CN114518850B

CN114518850B - Safe re-deleting storage system based on trusted execution protection and comprising re-deleting and encryption

Info

Publication number: CN114518850B
Application number: CN202210169874.4A
Authority: CN
Inventors: 杨祚儒; 李经纬; 李柏晴
Original assignee: Yunlianwang Technology Guangdong Co ltd
Current assignee: Yunlianwang Technology Guangdong Co ltd
Priority date: 2022-02-23
Filing date: 2022-02-23
Publication date: 2024-03-12
Anticipated expiration: 2042-02-23
Also published as: CN114518850A

Abstract

The invention is suitable for the technical improvement field of large-scale data management, and provides a safe re-deleting storage system based on trusted execution protection and encryption after re-deleting, which comprises a client, a data channel, a control channel and a cloud server, wherein the client is connected with the cloud server through the data channel and the control channel and is used for uploading plaintext data blocks of the client to an enclave of the cloud through the data channel; the cloud server is used for maintaining global fingerprint indexes to track data blocks stored by all clients, removing repeated data blocks in the enclave, encrypting non-repeated plaintext data blocks, and finally storing ciphertext data blocks in the storage pool; the data channel is used for transmitting plaintext data blocks initiated by a client, and the control channel is used for transmitting stored related operational commands. The system effectively improves the storage efficiency and optimizes the performance. The system has simple structure and greatly saves the system overhead.

Description

Safe re-deleting storage system based on trusted execution protection and comprising re-deleting and encryption

Technical Field

The invention belongs to the technical improvement field of large-scale data management, and particularly relates to a secure re-delete storage system based on trusted execution protection and encryption after re-delete.

Background

In the face of rapid growth in data volume, storing data on public cloud services provides a viable low-overhead, large-scale data management solution ^[1] . To prevent data privacy disclosure, customers often require end-to-end encryption protection such that their data will be encrypted before storing in an untrusted public cloud ^[2] . However, because conventional symmetric encryption algorithms may result in each user using a different key to encrypt their own data, resulting in different encrypted data from different users, re-deletion of data across users is not supported.

There are many references in the literature on how to seamlessly integrate encryption algorithms and data deduplication in a secure data deduplication storage system ^[3]-[7] We will refer to them collectively as "encrypt before delete" (DaE). DaE the data is encrypted at the client to ensure confidentiality of the data, and then the cloud application re-deletes the data across users, so that repeated encrypted data is removed to save storage overhead. In order to still retain the same content after encryption, daE encrypts the data using a symmetric key derived from the content of each data block, such that duplicate original data blocks (referred to as plaintext data blocks) are always encrypted with the same key to the same data block (referred to as ciphertext data blocks), after which the same ciphertext data blocks are removed by data deduplication.

While DaE is popular, it is believed that DaE has some fundamental drawbacks in terms of key management overhead, incompatibility with compression, and security (see 2.1). Since DaE always generates a key for each data block to encrypt before re-deletion, it will not only unnecessarily generate a large number of keys for re-deleting data, but these data blocks will be removed at the time of re-deletion thereafter ^[8] . In addition, this results in additional key storage overhead to manage all data blocksIs used for the key(s). For the non-duplicate ciphertext data blocks stored by DaE, it is difficult to further reduce the storage space by compression because the content thereof appears to be completely random. Furthermore, daE requires deterministic encryption to preserve the ability to re-delete ciphertext data blocks. The nature of such deterministic encryption is susceptible to frequency analysis causing information leakage ^[9]-[10] 。

DaE has the limitation of forcing us to explore a simple but unexplored design paradigm called "deduplication followed by encryption" (DbE). It first re-deletes the block of plaintext data and then encrypts the remaining non-duplicate block of plaintext data using a key that is independent of the content of the block of data. DbE differs from DaE mainly in that it does not require management of keys for encryption or decryption for each data block, thus solving the limitations of DaE. However, one of the main reasons DbE has not been explored in secure data deduplication storage systems is that blocks of plaintext data are no longer protected by encryption at the time of the deduplication.

Our main insight is that the process of re-deleting DbE can be protected by trusted execution techniques ^[11]-[12] . We therefore propose a deee, which is a data re-deletion system based on DbE trusted execution protection. DEBE build reinterIntel SGX ^[13] Above, it provides a trusted execution environment called an "enclave" for enabling secure data deduplication. One key challenge in implementing DEBE in SGX is limited enclave space (e.g., 128MiB [ 14)]). We therefore propose a frequency-based re-delete, which is a two-phase data re-delete solution that can achieve safe, lightweight data re-delete within a spatially restricted enclave. In particular, the DEBE first performs a deduplication of the high frequency occurring data blocks within the enclave, based on our observation that the high frequency occurring data blocks typically form a large portion of the duplicate data blocks (see 4.1). It then re-deletes the remaining infrequently occurring data blocks outside the enclave. The frequency-based re-delete design has the following key advantages: (1) High performance because it deletes most of the duplicate data when it is re-deleted in the first stage, thereby reducing fly-in Context switch overhead incurred when re-deleting data outside of the earth ^[14] The method comprises the steps of carrying out a first treatment on the surface of the (2) The higher storage efficiency is realized through re-deleting and compressing; (3) High security, since it re-deletes high frequency data blocks in the enclave, thereby protecting the frequency of these data blocks vulnerable to frequency analysis attacks ^[10] 。

We have implemented the DEG prototype and evaluated it in the context of a local area network. Compared to the currently prevailing DaE method, DEBE achieves significant performance improvements (e.g., compared to DupLESS when uploading non-duplicate and duplicate data ^[15] Improved by 9.83 times and 13.44 times, respectively), and also reduced information leakage without reducing storage efficiency (e.g., compared to TED) ^[10] The relative entropy of (c) is reduced by 86.8%, but TED requires additional memory overhead).

Re-delete is a widely deployed data reduction technique in modern storage systems ^[16]-[18] . We focus mainly on the data block based re-deletion technique, which re-deletes data blocks with minimal granularity. Specifically, the deduplication storage system first divides an input file into data blocks of different sizes. For each data block, it identifies each data block by computing its content's cryptographic hash (e.g., SHA-256) as the fingerprint of the data block, and by that fingerprint. It maintains a key-value store index, called a fingerprint index, for tracking the fingerprints of all existing stored data blocks and storing only non-duplicate data blocks. At the same time, it stores a manifest file, called a file allocation table, for each file, for tracking information of all data blocks in the file, so that the file can be reconstructed in the future. In addition, the re-deleting system compresses the non-repeated data blocks to eliminate the repeated data at byte level, thereby saving more storage space ^[19] 。

Encryption followed by re-deletion (DaE) seamlessly joins re-deletion and encryption to achieve both confidentiality of data and storage space savings. In DaE, the client first encrypts the plaintext data blocks and uploads the ciphertext data blocks to the cloud, and then re-deletes the ciphertext data blocks at the cloud. One popular encryption algorithm in DaE is Message Locking Encryption (MLE) ^[4] It specifies that the keys for encryption and decryption of the data blocks are generated from the content of each data block, so that identical blocks of plaintext data are always encrypted into identical blocks of ciphertext data for re-erasure of the blocks of ciphertext data. One example of an MLE is Convergence Encryption (CE) ^[6] It generates a corresponding key from the fingerprint of each data block.

CE is vulnerable to offline brute force attacks ^[15] Wherein an attacker can enumerate all plaintext data blocks to generate corresponding keys, thereby attempting to decrypt ciphertext data blocks. If decryption is successful, the original block of plaintext data may be inferred. DupLESS assists in key management by deploying a key server to defend against offline brute force attacks in CEs, the server generating keys for each data block based on a global secret (that can only be owned by the key server) and the fingerprints of the data block. In addition, dupLESS also utilizes a forgetting pseudo-random function (OPRF) ^[20] Key generation is implemented to prevent the key server from acquiring information of the data block and its key during key generation. Meanwhile, the DupLESS also performs speed limitation on the key generation request of the client to defend the malicious client from violently sending the key generation request for different plaintext data blocks to the key server.

The limitation is DaE is the design model of the currently mainstream secure re-delete storage system. Then, we consider DaE itself to have certain limitations in three ways.

Limitation-1: higher key management overhead. DaE generates one key for each data block, resulting in a large key storage overhead that requires maintaining keys for all data blocks. Furthermore, each client needs to encrypt the key of the data block it owns with its own master key. Thus, the key storage overhead increases in proportion to the number of data blocks and clients and has a greater impact on workload with higher redundancy ^[21] As they only need to store small amounts of non-duplicate data after re-deletion. In additionDupLESS generates a key for each data block before uploading the data block to the cloud, even if duplicate data blocks therein are removed when the cloud is re-deleted later. Because DupLESS uses OPRF and speed limitations in the key generation process, the key generation process has a large performance overhead. In short, daE incurs high key management overhead in both key storage and key generation.

Limitation-2: is not compatible with compression. In DaE, since the content of the encrypted non-duplicate data block is almost random, the cloud end cannot further compress the non-duplicate encrypted data block to save additional storage space. Although the client may compress the plaintext block prior to encryption and upload the encrypted compressed block, this may leak the length of the compressed block and introduce additional security risks ^[22] 。

Limitation-3: security risks. The server-assisted key management design in DupLESS may result in the key server being referred to as a single point of attack. If an adversary breaks the key server and has access to the global secret, it can infer the key of the data block like a CE through an offline brute force attack. Furthermore, daE is deterministic in nature, enabling a one-to-one mapping between blocks of plaintext data and blocks of ciphertext data. An attacker may initiate frequency analysis to infer the original plaintext data block from the frequency distribution of ciphertext data blocks in the deduplication storage system ^[9] 。

Disclosure of Invention

The invention aims to provide a secure re-delete-before-encrypt storage system based on trusted execution protection, which aims to solve the technical problems.

The invention is realized in such a way, a safe re-deleting storage system based on trusted execution protection, which is encrypted after re-deleting, comprises a client, a data channel, a control channel and a cloud server, wherein the client is connected with the cloud server through the data channel and the control channel and is used for uploading a plaintext data block of a user to an enclave of the cloud through the data channel; the cloud server is used for maintaining global fingerprint indexes to track data blocks stored by all clients, removing repeated data blocks in the enclave, encrypting non-repeated plaintext data blocks, and finally storing ciphertext data blocks in the storage pool; the data channel is used for transmitting plaintext data blocks initiated by a client, and the control channel is used for transmitting stored related operational commands.

The invention further adopts the technical scheme that: the cloud server is provided with an enclave, a storage pool and a complete index module, wherein the enclave is in communication connection with the enclave module, the output end of the enclave is connected with the input end of the storage pool, the enclave is used for deleting data again, confidentiality of a plaintext data block is guaranteed in the deleting process, a non-repeated plaintext data block is compressed, and the compressed plaintext data block is encrypted; the storage pool is used for storing ciphertext data blocks in the storage pool by the enclave, and the complete indexing module is used for completely tracking all non-repeated data block fingerprints.

The invention further adopts the technical scheme that: the system comprises an enclave deployment unit, a frequency tracking unit, a key management unit, a compression unit and an encryption unit, wherein the output end of the frequency tracking unit is connected with the input end of the frequency-based data re-deletion unit, the output end of the key management unit is respectively connected with the input end of the frequency-based data re-deletion unit and the input end of the encryption unit, the output end of the frequency-based data re-deletion unit is connected with the input end of the compression unit, and the output end of the compression unit is connected with the input end of the encryption unit.

The invention further adopts the technical scheme that: the key management unit comprises a data key, a query key and a session key, wherein the data key is used for encrypting and decrypting a compressed non-repeated plaintext data block in the secure storage; the inquiry key is used for protecting plaintext data block information when inquiring the complete index outside the enclave; the session key is used for each client to maintain a data channel with the enclave for secure data communication, and each data channel uses a short-term session key to protect its data channel, which key remains valid for a single communication session.

The invention further adopts the technical scheme that: the frequency tracking unit is used for tracking the frequency of the plaintext data blocks in the enclave to identify the data blocks with high frequency and the data blocks with non-high frequency, and further realize the frequency-based data re-deletion.

The invention further adopts the technical scheme that: the frequency-based data re-deleting unit divides re-deleting into two stages according to the frequency of the data blocks and removes all repeated plaintext data blocks; the frequency-based data re-deleting unit comprises a first-stage re-deleting unit and a second-stage re-deleting unit, wherein the first-stage re-deleting unit is used for maintaining a small fingerprint index of an enclave and re-deleting k data blocks which occur most frequently; and the second-stage re-deleting is used for carrying out second-stage re-deleting on the repeated data blocks which are not removed in the first stage, wherein the repeated data blocks comprise infrequently-occurring data blocks and newly-occurring frequent data blocks.

The invention further adopts the technical scheme that: the first stage of deleting again takes fingerprints of the plaintext data blocks as input, and obtains the current estimated frequency of the plaintext data blocks from CM-Sketch; checking the root node of the small top heap, if the current estimated frequency is smaller than the frequency of the root node, skipping the process of further inquiring the hash table by the enclave, and directly deleting again in the second stage; if the current estimated frequency hits the frequency of the root node, the enclave uses its fingerprint to further query the hash table.

The invention further adopts the technical scheme that: the second stage re-deleting the fingerprint of the plaintext data block by using the inquiry key, and inquiring the complete index outside the enclave through the OCall according to the encrypted fingerprint of the plaintext data block; if the encrypted fingerprint is found in the complete index, then the OCall returns encrypted data block information, which is to be decrypted in the enclave using the query key, and the enclave updates the address and the compressed size of the data block into the file allocation table; if the encrypted fingerprint is a new fingerprint of the full index, the enclave treats the data block as a non-duplicate data block and assigns an address to the data block, compresses the data block and records its compressed size.

The invention further adopts the technical scheme that: and the enclave compresses the non-repeated plaintext data block after the re-deletion in the key management unit, encrypts the compressed non-repeated plaintext data block into a ciphertext data block, writes the ciphertext data block into a container buffer area in the enclave, and sets the content of the buffer area as invariable and releases the buffer area to a cloud server for persistent storage when the buffer area is full.

The invention further adopts the technical scheme that: the enclave creates a file allocation table for each newly uploaded file, and each entry in the file allocation table records the address of the data block and the compressed size of the data block. When the enclave updates the file allocation table, there is no need to perform compression again on the repeated data blocks to obtain their compressed sizes, since the compressed data block sizes are stored in the top-k index and the full index.

The beneficial effects of the invention are as follows: the system effectively improves the storage efficiency and optimizes the performance. The system has simple structure and greatly saves the system overhead. Compared with the current mainstream method, the DEBE realizes remarkable performance improvement, the performance of the DEBE is respectively improved by 9.83 times and 13.44 times compared with that of DupLESS when non-repeated and repeated data are uploaded, information leakage is reduced, storage efficiency is not reduced, the relative entropy of the DEBE is reduced by 86.8% compared with that of TED, and the TED needs additional storage cost.

Drawings

FIG. 1 is a schematic diagram of a DEBE architecture according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a relationship between a data repetition rate and a data block frequency according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an architecture of an enclave provided by an embodiment of the present invention.

Fig. 4 is a schematic diagram of a frequency tracking module in an enclave according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a top-k index design according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of the overall performance provided by an embodiment of the present invention (experiment one).

Fig. 7 is a schematic diagram of multi-client performance (experiment three) provided by an embodiment of the present invention.

Fig. 8 is a schematic diagram showing the effect of the frequency distribution of the data block on the performance according to the embodiment of the present invention (experiment four).

Fig. 9 is a schematic diagram showing a comparison of the method for deleting the data according to the embodiment of the present invention (experiment five).

Fig. 10 is a schematic diagram of uploading and downloading performance on a real dataset (experiment six) provided by an embodiment of the present invention.

Fig. 11 is a schematic diagram showing comparison of storage efficiency of different re-deletion methods according to an embodiment of the present invention (experiment seven).

Fig. 12 is a schematic diagram of the safety of the different methods (experiment eight) provided by the embodiment of the present invention for frequency analysis.

Detailed Description

1-12, the secure re-delete storage system based on trusted execution protection, provided by the invention, is detailed as follows:

turning to "delete before encrypt"

In view of the limitations of DaE, we have studied an unexplored design paradigm that implements a secure data deduplication storage system at "deduplication followed by encryption" (DbE). The main idea is to firstly re-delete the plaintext data blocks, delete the repeated data blocks, then encrypt the non-repeated plaintext data blocks into ciphertext data blocks for storage.

DbE has many natural advantages over DaE. First, because DbE re-deletes blocks of plaintext data first, dbE may encrypt each non-duplicate block of plaintext data using a content-independent key as in conventional symmetric encryption (see 1) so as not to affect re-deletion of the data. This avoids generating and storing keys derived from the content of each data block, and thus reduces the key management overhead (i.e. limitation-1 in DaE is resolved). Second, dbE can apply compression to non-duplicate data after re-deleting the plaintext data, further saving storage overhead, and then encrypting the non-duplicate data blocks after compression (i.e., constraint-2 in DaE is resolved). Finally, since DbE can be encrypted using keys independent of the content of the data blocks, it is not necessary to generate keys for each data block and use a key server as with DupLESS anymore, which avoids the key server being a single point of attack (i.e. limitation-3 in DaE is resolved).

However, a major challenge in implementing DbE is deciding whether to perform a re-delete of data that is no longer cryptographically protected at the client or cloud. Here we mainly consider three cases:

1) Each client maintains a local fingerprint index for its own block of plaintext data. It then encrypts the non-duplicate block of plaintext data and uploads the resulting block of ciphertext data to the cloud. However, this approach does not enable data re-deletion across users.

2) A global fingerprint index is maintained at the cloud to track all the data blocks stored by the clients. Each client firstly submits fingerprints of the plaintext data blocks to the cloud end to inquire whether the fingerprints can be deleted again, and then the client encrypts the non-repeated data blocks and uploads the ciphertext data blocks to the cloud end. This approach is also known as source-based re-delete ^[23] But it is vulnerable to side channel attacks ^[23]-[24] Because any malicious client can infer whether other clients have stored a certain target data block by querying whether the target data block can be re-deleted.

3) And uploading all the data blocks to the cloud end by each client, and then deleting the data again by the cloud end according to the maintained global fingerprint index of the data blocks stored by all the clients. This approach is also known as goal-based data deduplication ^[23] It hides the data re-deletion mode from the client and can safely resist side channel attacks. However, each client inevitably exposes its block of plaintext data to the cloud.

Therefore, dbE has not been widely explored in the existing literature, and the existing research has been mainly focused on realizing a secure data deduplication storage system based on DaE.

Intel SGX

In this work we implement DbE primarily through target-based data deduplication, and demonstrate how we protect DbE through trusted execution hardware. We use Intel SGX ^[13] To enable trusted execution, while our design may also be combined with other hardware that supports trusted execution (e.g., ARM TrustZone ^[25] And AMD SEV ^[26] ) Used together.

SGX basis: SGX is a set of extended instruction sets for Intel CPUs to implement trusted execution environments in an encrypted and integrity protected memory area called Enclave Page Cache (EPC). It ensures confidentiality and integrity of data within the enclave through hardware protection. It provides two interfaces to interact with untrusted applications outside the enclave: 1) Enclave calls (ECalls) that allow applications to securely access content within the enclave, and 2) external calls (OCalls) that allow code within the enclave to call external application functions.

Challenge: implementation of DbE in SGX is not an easy matter due to the resource limitations of enclaves. First, EPC size is limited (e.g., up to 128 mibs ^[14] ). When the memory usage of the enclave exceeds the size of the EPC, it encrypts and stores unused memory pages in unprotected main memory and decrypts and verifies the integrity of the evicted pages when they are loaded back into the EPC. This can result in expensive EPC page-changing overhead ^[27] . Although recent SGX designs support EPC sizes up to 1TiB, they provide only weak security guarantees and have not been widely deployed due to their lack of integrity protection ^[28] . Second, ECalls and OCalls involve expensive hardware operations (e.g., flushing TLB entries), which can result in significant context switch overhead (e.g., about 8000 CPU cycles per call) ^[27] ). Therefore, we must address the limited EPC size limitations and context switch overhead of the enclave when designing DbE. Clients: [ computer ]Client, data channel: data channel, control channel: control channel, cloud: cloud, enclave, storage pool.

We have designed DEBE, an Intel SGX-based approach ^[13] The trusted execution protected data deduplication storage system of (a) to provide a case for implementation DbE.

Fig. 1 shows the architecture of the dbe. The dbe does not require maintenance of a dedicated key server like DupLESS. We consider a scenario in which multiple clients store data to a cloud server (or simply cloud). The DEBE performs object-based data deduplication to delete duplicate data of multiple clients. Currently, each DEBE client uploads all data of the DEBE client to the cloud for data re-deletion. Although the client may first re-delete its own data at the client to save upload bandwidth without introducing bypass attacks ^[23] But my design does not presently take this assumption into account.

In order to prevent the cloud server from accessing any one plaintext data block when the data is re-deleted, the DEBE deploys an enclave on the cloud server and performs the data re-deletion in the enclave. Initially, each client establishes two secure communication channels with the cloud: 1) A control channel with the cloud server for transmitting stored related operation commands; 2) And a data channel with the enclave for transmitting the plaintext data blocks originated by the client. We use the traditional SSL/TLS authentication approach to prevent tampering with the control channel, while the data channel is set by Diffie-Hellman key exchange.

When uploading a file to the cloud, the client first segments the file data to generate a plaintext data block of fixed or variable size. It then sends an upload request to the cloud over the control channel and sends all plaintext blocks of data over the data channel to the enclave. And the enclave re-deletes and compresses the received plaintext data blocks, encrypts the rest plaintext data blocks into ciphertext data blocks, and finally stores the ciphertext data blocks and the file allocation table in a later storage pool.

When a file is to be downloaded, the client first wants to send a download request to the cloud through the control channel. Then, the enclave retrieves the corresponding file allocation table and its corresponding ciphertext data block. Finally, the enclave decrypts the ciphertext data block, and then the ciphertext data block is returned to the client through a data channel after being decompressed.

We consider an honest but curious attacker who does not modify the system protocol, but wishes to break the confidentiality of the data by identifying the content of the original block of plaintext data corresponding to a block of ciphertext data stored on the cloud. An attacker can enter the cloud server and access any data stored in the unprotected main memory as well as the ciphertext data blocks stored in the memory pool. It may also tap the content of the ocals issued to unprotected main memory (e.g., parameters used by ocals and untrusted function calls).

Our threat model has the following assumptions.

1) If an attacker can control an attacked client, it can access all plaintext data blocks that the client has. However, since DEBE performs target-based data deduplication, an attacker cannot access or infer blocks of plaintext data for other unaddressed clients.

2) The enclave is trustworthy and reliable. Its authenticity can be verified by remote attestation at the first boot ^[13] . Any denial of service attack or bypass attack against SGX may be protected by existing solutions.

3) The cloud server supports the integrity of remote audit data and can be extended to multi-cloud storage to achieve fault tolerance of the data. As we do not consider the integrity and fault tolerance of the data when implementing.

The DEBE is intended to allow multiple clients to securely outsource the data management that they store to public cloud storage services. It is directed primarily to storage workloads with highly redundant content (e.g., periodic backups and file system snapshots) that can effectively reduce storage overhead through data deduplication and compression. It aims to achieve the following design goals:

1) High performance. The key management overhead of the DEBE is significantly lower than that of DaE. In view of the enclave limited EPC size and expensive context switch overhead, the deee can generate limited overhead in SGX.

2) Higher storage overhead savings. The DEBE applies a data deduplication technique to remove duplicate data blocks across multiple users. It supports accurate data re-deletion, meaning that it can delete all duplicate data blocks. It also compresses non-duplicate data blocks after re-deleting the data to further save additional storage space.

3) Confidentiality. Security of DaE retained by the dbe. Although the DEBE does not maintain a proprietary key server for generating keys as the DupLESS does, it can still resist offline brute force attacks.

4) Stronger robustness than DaE. The DEBE alleviates the problem of single point attacks present in DaE by eliminating the use of a key server. It also alleviates the problem of information leakage caused by frequency analysis for DaE.

The core of the DEBE is to re-delete data in an enclave (deployed in a cloud server), and meanwhile, confidentiality of a plaintext data block is guaranteed in the re-deleting process. For this reason, we propose a data re-deletion approach based on the frequency of data blocks to support secure data re-deletion with enclave resource limitations.

Naive method: to identify all duplicate data blocks within an enclave, one simple approach is to maintain a complete fingerprint index (or simply complete index) within the enclave in order to track all non-duplicate data block fingerprints completely. It performs the re-deletion of data by checking the complete index in the enclave, compresses non-duplicate data blocks, encrypts the compressed data blocks, and finally stores. However, for large-scale outsourced storage, the size of the global index increases linearly with the number of non-duplicate data blocks. Keeping the full index within the enclave may result in memory usage of the enclave exceeding the limited EPC size, and thus may result in significant page-changing overhead of the EPC.

Another simple design manages the complete index outside the enclave such that the enclave passes through OCalls and takes the fingerprint of the data block as input, and queries the complete index outside the enclave to check if the data block has been stored. This design saves the use of EPC, but would trigger a large number of ocals to query the complete index, thus resulting in an expensive context switch. In addition, an attacker on the cloud server may monitor the ocals information submitted by the enclave and infer stored ciphertext data block information from the fingerprint information of the data block (e.g., by frequency analysis).

Our method: we propose frequency-based data re-deletion that can support secure data re-deletion in resource-limited enclaves. Our primary basis is that the frequency of data blocks (i.e., the number of data blocks that recur) is highly skewed in the actual backup workload, so a small portion of data blocks may produce a large number of duplicate data blocks. To demonstrate this observation, we analyzed five real backup data sets. We count the repetition rate of a given data block (defined as the ratio between the total size of repeated data blocks generated from within a given set of data blocks and the number of repeated data blocks in the entire data set). Fig. 2 shows the repetition rate versus data block frequency (ordered in descending frequency order). For example, in a VM dataset, the first 5% of frequently occurring data blocks contribute approximately 97% of duplicate data. This means that if we maintain a smaller fingerprint index to track these top 5% frequently occurring data blocks, we can use deduplication to remove approximately 97% of the duplicate data and achieve higher storage efficiency.

According to our observations, the main idea of frequency-based data re-deletion is to decompose the process of data re-deletion. The method manages a smaller fingerprint index in the enclave, thereby realizing the re-deletion of data blocks with higher occurrence frequency. In addition, it maintains a complete fingerprint index outside the enclave to delete copies of infrequently occurring data blocks. Frequency-based data re-deletion may address performance and security issues. From a performance perspective, it maintains only a small index of fingerprints about frequently occurring data blocks within the enclave to remove most of the duplicate data blocks. Thus, it mitigates the usage overhead of the EPC as well as the context switch overhead, since it only needs to call ocals for infrequently occurring data blocks to query the complete index outside the enclave. From a security perspective, since data blocks that occur at high frequencies are more susceptible to frequency analysis, we simply re-delete data blocks that occur at high frequencies within the enclave. Therefore, an attacker in the cloud cannot easily learn the frequency of the high-frequency data, thereby alleviating information leakage due to frequency analysis.

Enclave architecture and design roadmap: fig. 3 depicts the architecture of the enclave in the deee. At initialization, the enclave needs to configure a set of keys at boot time and establish a secure data channel with each client. The enclave then records the frequency of each block of plaintext data received from the client data channel. Based on the frequency of the data blocks, frequency-based data deduplication removes duplicate items of frequently occurring data blocks, and interacts with the full index outside of the enclave to remove duplicate items of non-frequently occurring data blocks. The enclave may also compress non-duplicate blocks of plaintext data and encrypt the compressed blocks of plaintext data. Finally, the enclave stores the ciphertext data block in a storage pool. Storage pool: storage pool, cloud: cloud, enclave: enclave, full index: complete index, frequency-based deduplication: frequency-based data re-deletion, frequency tracking: frequency tracking, key management: key management, compression: compression, encryption: encryption.

Key management, enclave maintains a set of keys for secure storage of data blocks after data re-deletion and compression and secure communication with clients.

Data key and query key: the enclave maintains two long-term keys that remain valid throughout the life of the enclave: 1) A data key for encrypting and decrypting compressed non-duplicate blocks of plaintext data in secure storage, 2) a query key for protecting information of blocks of plaintext data when querying a complete index outside an enclave. After authenticating the enclave by remote attestation and initiating bootstrapping, it initializes the data key and the challenge key through the SGX's mechanism that provides confidentiality. Notably, protecting the data blocks using only two keys can significantly reduce the key management overhead per data block in DaE.

Session key: each client maintains a data channel with the enclave for secure data communication. Each data channel is secured using a short-term session key that remains valid for a single communication session. It establishes a session key for the data channel using the Diffie-Hellman key exchange protocol over the control channel with the cloud server. The session key is stored in the enclave during the client communication session and will be released after the session is completed (the control channel and data channel will be released together).

Master key for each client: the enclave may also require each client to submit a master key for each storage request over the data channel. The method uses the master key to protect the file allocation table of the client file and ensures that the ownership of the file by the client is ensured. Like the session key, the enclave only retains the client's master key for each communication session and destroys the master key at the end of the session, so the memory overhead of the master key is limited.

Frequency tracking, the enclave needs to track the frequency of a plaintext data block so as to identify data blocks with high frequency occurrence and data blocks with non-high frequency occurrence, and further realize frequency-based data re-deletion. To reduce the use of EPC, enclaves use CM-sktech to track the approximate frequency of each data block, and use only a fixed amount of memory space and a small error rate. Fig. 4 shows how the enclave implements frequency tracking in CM-sktech. CM-Sketch is a two-dimensional array with r rows of w count values each. One key design here is how to map blocks of plaintext data to count values with less computational overhead. To this end, we use a cryptographic hash (e.g., SHA-256) in computing the data block fingerprint so that we can treat the data block fingerprint as a random input value and map it directly to the counter. Specifically, for each block of plaintext data M, the enclave divides the fingerprint of M into r slices. It needs to map the ith slice modulo w to one of the count values in the ith row and increment each count value by 1. This is in contrast to conventional CM-sktech, which uses a separate hash function to map inputs to different rows of counters, thus resulting in higher computational overhead. To estimate the frequency of a data block, the enclave uses the minimum of the mapped r count values as its estimated frequency. By default we set r to 4,w to 256K, each count value is 4 bytes, so CM-14 overall occupies 4MiB of EPC usage.

Frequency-based data deduplication, we mainly introduce a frequency-based data deduplication design that separates the deduplication into two phases based on the frequency of the data blocks, and removes all duplicate plaintext data blocks.

First stage re-deletion: the enclave maintains a small fingerprint index, called the top-k index, to re-delete the k data blocks that occur most frequently. We combine a small top heap with a hash table to implement the top-k index, as shown in fig. 5. The small top heap distinguishes the first k frequently occurring data blocks and the non-frequently occurring data blocks, so that the root node of the small top heap is the plaintext data block with the smallest frequency in the current k frequently occurring data blocks. Each node in the small top heap stores a pointer to a record in the hash table. On the other hand, as in conventional data deduplication, we use hash tables to detect duplicate data blocks. The record in each hash table stores a mapping from a data chunk fingerprint to a set of elements: 1) a pointer to a node in the small top heap (i.e., the node in the small top heap and the record in the hash table map to each other), 2) an estimated frequency of the data block, 3) an address of the data block (including an ID of the storage container and an offset inside the container), and 4) a size of the compressed data block.

For a given block of plaintext data, to perform the first stage of re-erasure, the enclave first takes as input the fingerprint of the block of plaintext data, and obtains the current estimated frequency of the block of plaintext data from the CM-sktech. Firstly, checking the root node of a small top heap, if the current estimated frequency is smaller than the frequency of the root node (namely, the data block which does not frequently occur in the data block), skipping the process of further inquiring the hash table by the enclave, and directly deleting again in the second stage; otherwise (i.e., the data block is a frequently occurring data block), the enclave uses its fingerprint to further query the hash table. For this scenario, we have two cases:

1) If the fingerprint is found in the hash table (i.e., the data block is a duplicate data block), the enclave updates its corresponding frequency in the hash table and adds the address of the data block and the compressed data block size to the file allocation table. As the frequency of the data block is updated, the enclave will also adjust the small top heap based on the pointer in the small top heap to the record in the hash table.

2) If the fingerprint is not found in the hash table (i.e., the data block is a new frequently occurring data block), the enclave creates a new record in the hash table and inserts a new node into the small top heap that contains pointers to the new record of the hash table. If k nodes are already stored in the small top heap, the enclave deletes the root node in the current small top heap and deletes the record corresponding to the hash table by the pointer stored in the node to the record in the hash table. Since the data block may have been previously stored, the enclave may perform a second stage re-delete on the data block and update the address of the data block and the size of the compressed data block according to the result of the second stage re-delete.

We can show that the space usage overhead of the top-k index is relatively small. Assuming a data block fingerprint of 32 bytes (a hash value of SHA-256), the address of the data block is 12 bytes (an 8-byte storage container ID and a 4-byte container internal offset), and the size of the compressed data block is 4 bytes. For each k most frequently occurring data blocks, the record in each hash table also requires an additional storage of a frequency of 4 bytes and a pointer to a node in the small top heap. Since we implement the small top heap with an array, a pointer to a node in the small top heap can be represented as a 4 byte integer. In addition, each node in the small top heap also maintains an 8-byte pointer to the record in the hash table. In general, for the k most frequently occurring data blocks, the top-k index will assign 64 bytes to each data block (excluding pointers implemented inside the hash table, here we use the hash table of the c++ standard library as an implementation). For example, if we track 512K most frequently occurring data blocks, the EPC memory usage of the top-K index is 32MiB.

We further show that the operation of top-k indexing has lower temporal complexity. For each block of plaintext data, the top-k index may return the minimum frequency (from the root node) of the blocks in the current small top heap in a constant time. For the most frequently occurring data blocks, the top-k index needs to further examine the hash table (in a constant time) and update the small top heap. Because the pointers pointing to the nodes in the corresponding small top heap are stored in each record of the hash table, the small top heap can be directly updated at the corresponding node position, and the whole small top heap does not need to be searched. Thus, the time complexity of updating the small top heap is O (log).

And (3) re-deleting in the second stage: for duplicate data blocks that were not removed in the first phase, the DEBE may undergo a second phase re-delete, which includes infrequently occurring data blocks and newly occurring frequent data blocks. Because EPC size is limited, the probe manages a complete index in the enclave. We implement the full index with a hash table where each record stores a mapping from the encrypted plaintext block fingerprint to the encrypted block information (i.e., both the block address and its compressed size are encrypted by the query key). The main reason we encrypt the plaintext block fingerprint and block information is that the complete index outside the enclave is no longer protected by the enclave, so we encrypt the block fingerprint and information to prevent an attacker in the cloud service from deducing the content of the plaintext block from this information.

Given a block of plaintext data, to perform the second phase re-delete, the enclave encrypts the fingerprint of the block of plaintext data (the block of duplicate data that was not removed in the first phase) using the query key, and then it queries the full index outside the enclave by the OCall based on the encrypted block of plaintext data fingerprint. If the encrypted fingerprint is found in the complete index, the OCall returns encrypted data block information, the information is decrypted in the enclave by using the query key, and finally the address and the compressed size of the data block are updated into the file allocation table by the enclave; otherwise, if the encrypted fingerprint is a new fingerprint of the full index, the enclave treats the data block as a non-duplicate data block, and assigns an address to the data block, compresses the data block, and records its compressed size. The enclave then encrypts the address of the data block and its compressed size with the query key and updates their information into the full index. Since we expect most duplicate entries to have been deleted at the first stage of deduplication, the context switch overhead due to OCalls is limited here.

Storage management, container storage: the DEBE organizes the data blocks into fixed-size containers to reduce I/O costs. Specifically, the enclave compresses the non-repeated plaintext data block after the re-deletion, and encrypts the compressed non-repeated plaintext data block into a ciphertext data block. Thereafter, it writes the ciphertext data block into a container buffer within the enclave, which sets the buffer contents to immutable and releases them to the cloud server for persistent storage when the buffer is full. In addition, the enclave creates a file allocation table for each newly uploaded file, and each entry in the file allocation table records the address of the data block and the compressed size of the data block. When the enclave updates the file allocation table, it does not need to perform compression again on the repeated data blocks to obtain their compressed sizes, since the compressed data block sizes are stored in the top-k index and the full index. To guarantee ownership of the file, the enclave encrypts the file allocation table using the client's master key. Since the enclave regards the container (containing a plurality of ciphertext data blocks) as a basic I/O unit and the size of the data blocks is stored in the file allocation table (protected by each user master key), the probe ensures the security of compression and avoids revealing the length of the compressed data blocks to the cloud server.

Another design possibility is to consider the container as a basic compression unit (rather than data block level compression). Specifically, the enclave first writes non-duplicate blocks of plaintext data into a storage container, compresses the entire storage container, and encrypts the compressed container with a data key. However, this design creates additional data recovery overhead because the enclave needs to decompress and decrypt this container, even though it only needs to recover a single block of data in the container. We will effectively design the compression at the container level as our future work.

And (3) downloading: to download a file, the client first sends the requested file name and master key to the enclave over a secure data channel. The enclave retrieves the corresponding file allocation table using the given file name and decrypts using the client-provided master key. It then parses the decrypted file allocation table to obtain the address of the data block and the size of the compressed data block. To recover all the data blocks, the enclave exposes the storage container IDs of the required data blocks to the cloud server and performs the corresponding I/O operations through the OCalls. Once the cloud server extracts the container into main memory, the enclave may access the ciphertext data block according to its internal offset and decrypt the ciphertext data block via the data key. Finally, it sends the decompressed plaintext data blocks to the client over the data channel.

In recovering the data, the enclave exposes only the ID of the storage container to the cloud server, not the specific address of the ciphertext data block. Therefore, it can prevent an attacker in the cloud server from launching a frequency attack by counting the number of accesses of the ciphertext data block.

Security discussion we discuss security of the dbe in concert with the threat model we consider. We mainly consider two cases.

Case one: a snapshot attacker has one-time access to the data content in the unprotected memory and storage pool. The DEBE provides semantic security for plaintext data blocks generated by the client by end-to-end encryption of ciphertext data blocks that are not stored in the cloud server. In particular, it establishes a secure data channel, encrypting all plaintext blocks of data exchanged between the client and the enclave via the session key. It performs data deduplication (not noticeable by the cloud server) within the enclave and encrypts non-duplicate plaintext data blocks into ciphertext data blocks by a data key before storing the ciphertext data blocks. In general, DEBE adopts traditional symmetric encryption in the process of data transmission and data storage, and semantic security is realized.

And a second case: the persistent attacker listens to the OCalls during the data re-deletion process. The DEBE encrypts the plaintext block fingerprint and the information of the block by using the query key within the enclave and then takes them as input to OCals to query the full index outside the enclave. Thus, even if an attacker can tap into an OCall, it cannot infer the original input from the OCall.

On the other hand, one potential information leak is that a persistent attacker (residing in the cloud server for a long time) can learn the frequency information of the data block during the re-deletion, because the enclave encrypts the fingerprint of the repeated data block into the same encrypted fingerprint during querying the full index. Specifically, an attacker can track the frequency distribution of the encrypted fingerprint by listening to OCalls and initiate frequency analysis to infer the original block of plaintext data. However, the DEBE limits such information leakage to infrequently occurring data blocks. Our evaluation shows that the dbe can reduce information leakage more effectively than the most advanced method TED (security of exchanging data with storage efficiency).

Implementation we implemented the prototype of DEBE in C++ on Intel SGX SDK Linux 2.7 based Linux. It uses OpenSSL 1.1.1 ^[39] And Intel SGX SSL ^[40] Encryption related operations are implemented. Our current prototype contains 17.5K lines of code.

Each client implements FastCDC [38] to implement variable length chunking, with minimum, average and maximum data chunk sizes set to 4KiB, 8KiB and 16KiB, respectively. The size of the storage container was 4MiB. We implement the Diffie-Hellman key exchange protocol based on NIST P-256 elliptic curves for the management of session keys for data channels between clients and enclaves. The enclave computes the fingerprint of the block of plaintext data by SHA-256 and encrypts the block of plaintext data and the block of data fingerprint by AES-256 (when querying the full index). Both SHA-256 and AES-256 use the Intel new instruction set for hardware acceleration. We have also implemented LZ4 for further lossless compression after re-deleting the data block.

To mitigate the context switch overhead, the DEBE processes the data blocks in a batch (128 data blocks as a batch by default). In addition, to improve the performance of the download, the cloud server maintains an LRU cache (256 mibs by default) in memory to hold recently accessed storage containers. For access requests issued by the enclave for each storage container, the cloud server would first check the cache and only read the storage container from local storage if it is not in the cache.

Experimentally we deploy the DEBE onto a local cluster consisting of multiple machines connected by 10 GbE. Each machine has a quad core 3.4GHz Intel Core i5-7500 CPU and 32GiB RAM, and Ubuntu16.04 is installed. We deploy one or more clients, one key server (applicable only to DaE), and one cloud storage server on different machines. The cloud storage server is configured with a Toshiba DT01ACA 1TiB 7200-SATA hard disk. By default, DEBE sets K of the top-K index in the enclave to 512K.

We use the synthetic data set and the real data set to evaluate the deee. We summarize our evaluation results as follows.

1) Compared with the current most advanced DaE method, the speed of uploading non-repeated data and repeated data by DEBE is improved by 9.83 times and 13.44 times respectively (experiment one). Wherein frequency-based data re-deletion only accounts for 5.9-12.0% of the total upload time (experiment two). The DEBE still maintains higher performance in the scenario of multiple clients simultaneously uploading and downloading (experiment three) and various synthesized workloads (experiment four).

2) For real workloads, the DEBE increases the speed 1.16-2.51 times over the most advanced deduplication alternatives (experiment five), and maintains higher performance in long-term upload and download scenarios (experiment six). It also achieves higher storage efficiency (experiment seven) and achieves security below frequency analysis (experiment eight).

A dataset, a composite dataset: we consider the use of two synthetic data sets for evaluation. The first data set, SYN-Unique, includes non-duplicate and compressible data blocks. Specifically, we generated a set of 2GiB compressible files for SYN-Unique by an LZ data generator that synthesizes compressible data based on an algorithm of SDGen. The LZ data generator takes as input two parameters: 1) Compressibility, the compressibility that specifies the generated data, and 2) random seed for generating data. We set the compression ratio to 2 to simulate the backup workload in the real world and change the random seed to generate a different synthetic file. We perform a dicing on each synthetic file to ensure that the data blocks it produces are globally unique among all files. We used this dataset to conduct stress tests in the context of processing non-duplicate data (experiment one, experiment two, and experiment three).

The second data set, SYN-Freq, comprises a set of data blocks that follow a certain target frequency distribution. To this end, we construct a compound file generator. Specifically, the data blocks in the file we generate follow the Zipf distribution. Our generator takes as input three parameters: 1) the number of original data blocks, 2) the re-erasure rate (i.e. the ratio between the original data size and the non-repeated data size), and 3) the Zipfian constant (a larger constant means that the frequency distribution is more sloped). To generate a composite file, our generator prepares a set of non-duplicate fingerprints based on the expected number of non-duplicate data blocks (i.e., the number of original data blocks divided by the deduplication rate). It assigns a compression rate to each fingerprint according to a normal distribution with a mean of 2 and a variance of 0.25. To generate each raw data block, our generator extracts one fingerprint from the previous set of fingerprints according to the target Zipf distribution and constructs its content using the LZ data generator with the compression ratio and the fingerprint (as a random seed) as inputs. Finally, we generated a set of synthetic files for SYN-Freq, where each file contained 13107200 8KiB original data blocks (i.e., 100GiB total) and the re-delete rate was 5. The number of non-duplicate data blocks here is also large enough that the top-k index can only track a portion of the non-duplicate data blocks. We changed the Zipfian constant to investigate the effect of the degree of propensity of the frequency distribution on performance (experiment four).

True dataset: we consider five real world backup workloads to evaluate the deee performance (experiment five and experiment six), storage efficiency (experiment seven), and security (experiment eight):

1) DOCKER: dock mirror image of Couchbase from dock Hub (from v4.1.0 to v7.0.0);

2) LINUX: a snapshot of Linux source code (from a stable version between v2.6.131 and v 5.9);

3) FSL: a system master catalog file snapshot, wherein we selected 42 master catalog file snapshots from nine users in 2013;

4) MS: windows file system snapshots, where we selected 30 snapshots of approximately 100GiB size;

5) VM: virtual machine snapshot.

Table one shows statistics for five real world datasets. Since FSL, MS and VM contain fingerprints of data blocks only, we will use the data block fingerprints to generate compressible data blocks like SYN-Freq.

Table one: statistical information of real data sets

Evaluation on the composite dataset to check the maximum performance achievable without disk I/O overhead impact we load the composite file into the memory of each client before each test and let the cloud server store all the re-deleted data in memory. We report the average of the five runs and include 95% confidence intervals based on student t distribution (excluding the line graph).

Experiment one (overall performance): we evaluate the performance of the entire system upload (download). We consider a single client uploading the same 2GiB file twice in succession to investigate the maximum achievable performance of uploading non-duplicate data and duplicate data, respectively. Then, let us let the client download the same file. We tested the upload (download) speed of each operation.

We compared the deee with three methods of DaE: 1) DupLESS, which implements OPRF-based server-assisted key management; 2) TED, which generates a key for each data block based on a key server using a lightweight hash calculation and 3) CE, converges the encryption scheme. In order to study the security overhead of the DEBE, we also introduced the re-deletion (Plain) of the plaintext data block, and the client side uploads the plaintext data block to the cloud server for data re-deletion and compression without any security protection. Unlike DEBE and Plain, the scheme of DaE (i.e., dupLESS, TED, and CE) is not compatible with compression algorithms. For fair comparison we have implemented all the comparison schemes in c++ themselves.

Fig. 6 shows the upload speed. DEBE performs better than all DaE methods. In uploading non-duplicate data, the DEBE achieves 9.83, 1.28 and 1.22-fold improvement over DupLESS, TED and CE, respectively, by avoiding the generation of keys based on the content of each data block. Even if the DEBE applies compression, its compression overhead is masked by the performance boost that avoids the key generation overhead like DaE. When uploading duplicate data, the DEBE becomes more efficient, improving performance 13.44 times, 1.71 times and 1.65 times than DupLESS, TED and CE, respectively, mainly because it avoids encrypting and compressing duplicate data blocks. In contrast to Plain, the probe incurs only 13.6% and 7.8% performance overhead when uploading non-duplicate and duplicate data, respectively.

Fig. 6 shows the download speed. All DaE methods follow the same download mode, i.e. the client retrieves the ciphertext data blocks and their corresponding keys from the cloud server, then decrypts the corresponding ciphertext data blocks and reconstructs the original file. Compared to the method of DaE, the deee results in a 12.3% reduction in download speed due to the reading of the data blocks by ocals into the enclave for decryption and decompression. Furthermore, the download speeds of the DEBE and DaE are reduced by 38.1% and 17.4%, respectively, compared to Plain, because they require decryption of the data blocks.

Experiment two (upload operation decomposition): we decompose the upload operation. We consider the same scenario as experiment (i.e. the client uploads the same 2GiB file from SYN-Unique twice in succession) and measure the computation time of the client and enclave in different steps of uploading: 1) Blocking, namely, the client side blocks the input file to generate a plaintext data block; 2) Secure transmission, the enclave exchanging a session key with the client and decrypting the received ciphertext with the session key; 3) Fingerprint calculation, namely calculating the fingerprint of each plaintext data block in the enclave; 4) Frequency tracking, wherein the enclave estimates the frequency of each plaintext data block by using CM-Sketch; 5) The first stage of re-deleting, namely deleting repeated plaintext data blocks by the enclave through a top-k index; 6) The second stage of re-deleting, wherein the enclave inquires the complete index outside the enclave through OCals to delete the rest repeated data blocks; 7) Compressing, in-flight, non-duplicate blocks of plaintext data; 8) And encrypting the compressed plaintext data block by using the data key by the enclave.

Table two shows the results (by measuring the processing time per 1MiB upload). Fingerprint computation and compression are the most time consuming steps at the first upload (i.e. uploading non-duplicate data blocks) because they do complex computation for all data blocks. On the other hand, frequency-based data re-deletions (including frequency tracking, first stage re-deletions, and second stage flushing deletions) need only account for 12.0% of the total time. Since the cloud server does not store data before the first upload, each data block is regarded as a non-duplicate data block and is subjected to the first-stage re-deletion and the second-stage re-deletion processes. At the time of the second upload (i.e., uploading duplicate data blocks), all duplicate data blocks are removed at the time of the first-stage deduplication, so that the second-stage deduplication is not required. In this case, the frequency-based data re-deletion need only account for 5.9% of the total upload time.

And (II) table: decomposition of upload operation calculation time (processing time per 1MiB upload)

Experiment three (multiuser upload and download): we also evaluate the performance when multiple clients make upload/download requests at the same time. In addition to cloud servers, we have deployed 10 machines, each running two client instances to simulate up to 20 clients' scenarios of concurrent upload/download. Each client uploads a 2GiB file from the SYN-Unique to the cloud server and then downloads the same 2GiB file. We measure the total upload (download) speed as the ratio of the total upload (download) data size to the total time all clients complete the upload (download).

Fig. 7 shows the relationship between performance and the number of clients. The total uploading speed firstly increases along with the increase of the number of the clients, and when 10 clients exist, the total uploading speed reaches 812.0MiB/s. Then, due to resource contention in the enclave, when there are 20 clients, the total upload speed drops to 752.8MiB/s. The overall download speed has a similar trend, increasing to 733.0MiB/s and decreasing to 679.7MiB/s.

Experiment four (influence of data block frequency distribution): we evaluate the performance of the dbe in handling workloads of different data block frequency distributions. We have configured a client to upload the data chunks in each SYN-Freq (without involving the chunking operation) and measure the computational speed of the enclave (i.e., including steps other than chunking in table x).

Fig. 8 shows the results in a scenario with different k and Zipfian constants in the top-k index. A larger k means lower performance in all Zipifian constant scenarios because SGX generates significant page-changing overhead when the memory overhead of the enclave exceeds 64 MiB. For example, when the Zipfian constant is 1.05, the calculation speeds of k=512K and k=1m are 317.5MiB/s and 158.6MiB/s, respectively. Furthermore, the computation speed of the enclave increases with a more oblique distribution of data block frequencies (i.e., a larger Zipfian constant) because the most frequent data blocks contribute more duplicate data, thus mitigating the OCall overhead of querying a full index.

Evaluation on the true dataset, experiment five (performance of different deduplication methods): critical time frequency-based data re-deletion for the DEBE design we compared it to other designs. We mainly consider two popular memory efficient data re-deletion methods, namely similarity-based data re-deletion and locality-based data re-deletion. Both methods generate a feature based on a segment containing multiple data blocks while maintaining a small feature index within the enclave. Thereafter, it performs re-deletion by loading a portion of the out-of-enclave complete index into the enclave according to the matching condition of the features. The similarity-based data re-deletion is characterized by the minimum value of the fingerprints of the data blocks in each segment, while the locality-based data re-deletion is characterized by sampling the fingerprints of the data blocks in the segments. We refer to the previous work, we choose the size of the fragment to be 10MiB, with a sampling rate of 1/64 for locality-based data re-deletion. While both of these approaches to data re-deletion aim to mitigate the I/O overhead of the disk in the normal data deletion approach, our idea is that they can also be used to reduce EPC memory usage overhead, and then they can only support near-exact data re-deletion.

In addition to the approximately exact data deduplication described above, we include a simple exact deduplication as a baseline comparison. Specifically, in-enclave attempts to manage the full index within the enclave; when the size of the full index increases and cannot be put into the EPC, it triggers a page swap to evict unused EPC pages to normal memory. Out-enclave manages the complete memory in memory and checks duplicate data blocks by OCals query complete index. For fair comparison, all baseline methods compress non-duplicate data blocks. We upload the snapshots in each real backup dataset in order of creation time (see 6.1). We measured the calculated speed of the enclave as in experiment four.

Fig. 9 shows the experimental results. DEBE is generally preferred over all other methods. For example, in FSL, DEBE achieves 1.16-fold, 1.22-fold, 1.27-fold, and 2.51-fold speed increases over similarity-based, locality-based, out-enclave, and In-enclave, respectively. The reason for this is that the similarity-based and locality-based approach is an approximately exact data re-delete that additionally compresses and encrypts some duplicate data blocks, creating additional computational overhead. In addition, the DEBE performs the first stage re-delete and filters out much of the overhead of querying the out-of-enclave complete index through OCall. Although In-enclave is better than DEBE at smaller workloads (e.g., the first few snapshots In DOCKER and LINUX), its performance drops dramatically In subsequent snapshots due to its expensive page-changing overhead. The DEBE maintains a lightweight data structure (CM-Sketch and top-k index) in the enclave to mitigate page-changing overhead.

Experiment six (true data upload and download): unlike experiment one, we evaluate the upload and download performance of DEBE from the real dataset. The method comprises the steps of including disk I/O of a cloud server, uploading all snapshots in each data set in sequence, and finally downloading the snapshots. Since FSL, VM and MS only contain compressible data blocks, we let clients upload data blocks without blocking.

Fig. 10 shows the upload and download speeds for each snapshot. The upload speed increases gradually because the subsequent snapshots contain more duplicate data blocks, which are avoided from being compressed and encrypted by the DEG E. For example, the DEBE in the FSL uploads the first snapshot at 225.6MiB/s and the last snapshot at 263.4MiB/s. The download speed of the snapshot may be reduced due to increased I/O overhead caused by fragmentation of the data blocks (i.e., the data blocks of subsequent snapshots become more scattered after re-deletion). For example, the downloading speed of the first snapshot in the FSL is 131.4MiB/s, and the downloading speed of the last snapshot is reduced to 95.1MiB/s. We can mitigate fragmentation of data blocks by existing methods. We work this problem as the future.

Experiment seven (storage efficiency): we compared the deee with the approach of approximately exact re-deletion in experiment five (i.e., similarity-based data re-deletion and locality-based data re-deletion) to evaluate the storage efficiency of the deee. For each method we consider 1) performing only data deduplication without compression (D); 2) Data Deduplication and Compression (DC) is performed, which performs data deduplication and then compresses each non-duplicate data block. We measure the data reduction rate as the ratio of the size of the original data to the size of the data after the data has been re-deleted (compressed). Here we do not consider the size of the metadata because it is much smaller than the size of the original file data.

FIG. 11 compares the data reduction rate after each snapshot is stored under different conditions. As expected, the dbe is superior to approximately accurate data re-deletion. In FSL, after the last snapshot is stored, the uncompressed DEBE achieves 8.24 times compression (38.0% and 10.1% higher than the similarity-based and locality-based methods of data deduplication, respectively). In addition to deleting fingerprints again, compression also introduces additional storage savings, especially for LINUX that contains many byte-level code redundancies. After storing all LINUX snapshots, the data reduction rate after DEBE compression is 6.32 times, 145.9% higher than that of uncompressed case.

Experiment eight (safety for frequency analysis): we studied the security of the probe for frequency analysis. We compare examples of different dbes (i.e., k=128K, 256K, and 512K) with DaE methods CE and TED in experiment one. TED trades for security with storage efficiency, we configure its parameters to allow 15% of storage efficiency to be sacrificed (i.e., 15% more data stored to enhance security). Like TED we use the Kullback-Leibler distance (KLD), called the frequency distribution of ciphertext data blocks to a relative entropy that is all distributed, to quantify information leakage; a smaller KLD means less information leakage. DEBE can lead to information leakage when querying the complete index outside the enclave, we calculate its frequency distribution from the encrypted fingerprint in OCals.

Fig. 12 shows the KLD aggregating all snapshots in each dataset. All instances of DEBE implement a smaller KLD than CE and TED because it fully protects the high frequency data blocks when first stage re-delete is performed in the enclave. For example, in FSL, when K is 128K, the KLD of DEBE is 0.69 (72.6% and 35.5% lower than CE and TED, respectively). When we increase K to 512K, the KLD of the dbe drops further to 0.62, as more high frequency data blocks are deleted in the enclave. In addition, the KLD in DOCKER and LINUX is smaller than in VM. The reason for this is that the frequency distribution of the plaintext data blocks in the two data sets is relatively uniform in nature. Nonetheless, DEBE reduced the KLD of the TED by 42.6% and 86.8% in DOCKER and LINUX, respectively.

DaE method: various methods enable secure data re-deletion through DaE. Some methods, in addition to those described in 2.1, are designed from a security perspective. Random MLEs [36] and imles [37] apply non-deterministic encryption to prevent frequency leakage, but they use encryption primitives that are costly (e.g., non-interactive zero knowledge proof [36], fully homomorphic encryption [37 ]). Liu et al [35] suggested that keys could be generated by a distributed key sharing protocol without relying on a dedicated key server, but it introduced the performance overhead of interactions between different clients. TED 10 alleviates the problem of frequency leakage by setting a sacrifice in memory efficiency. In contrast, the DEBE implementation DbE solves both key management overhead and security issues.

SGX in combination with secure data deduplication: SGX has been used for secure data deduplication. Dang et al ^[34] SGX is used as a trusted agent to save bandwidth when secure data is re-deleted. SPEED ^[33] And re-deleting the repeated calculation tasks in the enclave to improve the utilization rate of resources. You et al ^[32] And verifying ownership of the data re-deletion by using the SGX so as to realize safe data re-deletion. SeGShare ^[31] An enclave is deployed at the server side for file-based secure data deduplication, but it does not consider data block-based data deduplication, while the problem of fingerprint indexing is not addressed. S2Dedup ^[30] The server-side enclave is used to eliminate the trusted key server that is used to generate the keys and to secure re-delete the data by re-compacting the data blocks within the enclave, now outside the enclave. In contrast, DEBE performs data deduplication directly within the enclave to protect the plaintext data blocks. SGXDedeup ^[29] SGX is utilized to improve the performance of client re-deletion under DaE. The SGX-based data re-deletion method described above is still DaE-based.

The DEBE implements a design paradigm that has not been explored, namely, deduplication followed by encryption (DbE), for implementing a secure data deduplication storage system. It exploits the characteristics of Intel SGX and uses frequency-based data deduplication to maintain a fingerprint index in the enclave for data blocks that occur at high frequencies. We show that the dbe is superior to the traditional encryption-before-deduplication (DaE) based approach in terms of system performance, storage efficiency, and security.

[1]Data Age 2025. https://www.seagate.com/ourstory/data-age-2025/.

[2]Data privacy will be the most important issue in the next decade. https://www.forbes.com/sites/marymeehan/2019/11/26/data-privacy-willbe-the-most-important-issue-in-the-nextdecade/.

[3]A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proc. of USENIX OSDI, 2002.

[4]M. Bellare, S. Keelveedhi, and T. Ristenpart. Message-locked encryption and secure deduplication. In Proc. of EuroCrypto, 2013.

[5]L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making backup cheap and easy. In Proc. of USENIX OSDI, 2002.

[6]J. R. Douceur, A. Adya, W. J. Bolosky, P. Simon, and M. Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In Proc. of IEEE ICDCS, 2002.

[7]P. Shah and W. So. Lamassu: Storage-efficient host-side encryption. In Proc. of USENIX ATC, 2015.

[8]J. Li, P. P. Lee, Y. Ren, and X. Zhang. Metadedup: Deduplicating metadata in encrypted deduplication via indirection. In Proc. of IEEE MSST, 2019.

[9]J. Li, P. P. Lee, C. Tan, C. Qin, and X. Zhang. Information leakage in encrypted deduplication via frequency analysis: Attacks and defenses. ACM Trans. on Storage, 16(1):1–30, 2020.

[10]J. Li, Z. Yang, Y. Ren, P. P. Lee, and X. Zhang. Balancing storage efficiency and data confidentiality with tunable encrypted deduplication. In Proc. of ACM EuroSys, 2020.

[11]I. Anati, S. Gueron, S. Johnson, and V. Scarlata. Innovative technology for cpu based attestation and sealing. In Proc. of ACM HASP, 2013.

[12]M. Hoekstra, R. Lal, P. Pappachan, V. Phegade, and J. Del Cuvillo. Using innovative instructions to create trustworthy software solutions. In Proc. of ACM HASP, 2013.

[13]Intel(R) Software Guard Extensions. https: //software.intel.com/content/www/us/en/ develop/documentation/sgx-developer-guide/top.html.

[14]D. Harnik, E. Tsfadia, D. Chen, and R. Kat. Securing the storage data path with SGX enclaves. https://arxiv.org/abs/1806.10883, 2018.

[15]M. Bellare, S. Keelveedhi, and T. Ristenpart. DupLESS: Server-aided encryption for deduplicated storage. In Proc. of USENIX Security, 2013.

[16]A. Duggal, F. Jenkins, P. Shilane, R. Chinthekindi, R. Shah, and M. Kamat. Data domain cloud tier: Backup here, backup there, deduplicated everywhere! In Proc. of USENIX ATC, 2019.

[17]A. El-Shimi, R. Kalach, A. Kumar, A. Ottean, J. Li, and S. Sengupta. Primary data deduplication-large scale study and system design. In Proc. of USENIX ATC, 2012.

[18]D. T. Meyer and W. J. Bolosky. A study of practical deduplication. In Proc. of USENIX FAST, 2012.

[19]K. Srinivasan, T. Bisson, G. R. Goodson, and K. Voruganti. iDedup: Latency-aware, inline data deduplication for primary storage. In Proc. of USENIX FAST, 2012.

[20] M. Naor and O. Reingold. Number-theoretic constructions of efficient pseudo-random functions. Journal of the ACM, 51(2):231–262, 2004.

[21]G. Wallace, F. Douglis, H. Qian, P. Shilane, S. Smaldone, M. Chamness, and W. Hsu. Characteristics of backup workloads in production systems. In Proc. of USENIX FAST, 2012.

[22]D. Chen, M. Factor, D. Harnik, R. Kat, and E. Tsfadia. Length preserving compression: Marrying encryption with compression. In Proc. of ACM SYSTOR, 2021.

[23]D. Harnik, B. Pinkas, and A. Shulman-Peleg. Side channels in cloud services: Deduplication in cloud storage. IEEE Security & Privacy, 8(6):40–47, 2010.

[24]M. Mulazzani, S. Schrittwieser, M. Leithner, M. Huber, and E. Weippl. Dark clouds on the horizon: Using cloud storage as attack vector and online slack space. In Proc. of USENIX Security, 2011.

[25]S. Pinto and N. Santos. Demystifying ARM TrustZone: A comprehensive survey. ACM Computing Surveys, 51(6):1–36, 2019.

[26]AMD Secure Encrypted Virtualization (SEV). https: //developer.amd.com/sev/.

[27]S. Arnautov, B. Trach, F. Gregor, T. Knauth, A. Martin, C. Priebe, J. Lind, D. Muthukumaran, D. O’keeffe, M. L. Stillwell, et al. SCONE: Secure Linux containers with Intel SGX. In Proc. of USENIX OSDI, 2016.

[28]E. Feng, X. Lu, D. Du, B. Yang, X. Jiang, Y. Xia, B. Zang, and H. Chen. Scalable memory protection in the PENGLAI enclave. In Proc. of USENIX OSDI, 2021.

[29]Y. Ren, J. Li, Z. Yang, P. P. Lee, and X. Zhang. Accelerating encrypted deduplication via SGX. In Proc. of USENIX ATC, 2021.

[30]M. Miranda, T. Esteves, B. Portela, and J. Paulo. S2Dedup: SGX-enabled secure deduplication. In Proc. of ACM SYSTOR, 2021.

[31]B. Fuhry, L. Hirschoff, S. Koesnadi, and F. Kerschbaum. SeGShare: Secure group file sharing in the cloud using enclaves. In Proc. of IEEE/IFIP DSN, 2020.

[32]W. You and B. Chen. Proofs of ownership on encrypted cloud data via Intel SGX. In Proc. of ACNS, 2020.

[33]H. Cui, H. Duan, Z. Qin, C. Wang, and Y. Zhou. SPEED: Accelerating enclave applications via secure deduplication. In Proc. of IEEE ICDCS, 2019

[34]H. Dang and E.-C. Chang. Privacy-preserving data deduplication on trusted processors. In Proc. of IEEE CLOUD, 2017.

[35]J. Liu, N. Asokan, and B. Pinkas. Secure deduplication of encrypted data without additional independent servers. In Proc. of ACM CCS, 2015.

[36]M. Abadi, D. Boneh, I. Mironov, A. Raghunathan, and G. Segev. Message-locked encryption for lock dependent messages. In Proc. of CRYPTO, 2013.

[37]M. Bellare and S. Keelveedhi. Interactive message-locked encryption and secure deduplication. In Proc. of PKC, 2015.

[38]W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu, and Y. Zhang. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proc. of USENIX ATC, 2016.

[39]Cryptography and SSL/TLS Toolkit. www.openssl.org/.

[40]Intel(R) Software Guard Extensions SSL. https:// github.com/intel/intel-sgx-ssl.

For large-scale data management, cloud storage is required to achieve data confidentiality and high storage efficiency. The traditional method is based on a design mode of encrypting and then deleting, and then the data is encrypted and the data after encryption is deleted again. This design model based on encryption-before-deletion is considered to have certain drawbacks in terms of performance, storage efficiency, and security. In this paper we have studied a design pattern that has not been explored, namely, delete-before-encrypt. The design mode firstly re-deletes the data, and then encrypts the data which is not repeated. Although the design mode of deleting first and then encrypting can effectively solve the problems of performance and storage efficiency caused by managing the repeated data, the deleting process is not protected by encrypting any more. In order to solve the problem, we designed a DEBE, a secure re-delete-before-encrypt storage system based on trusted execution protection, which mainly uses Intel SGX to protect the process of data re-delete. DEBE uses a data frequency-based re-deletion method, which first re-deletes data with high occurrence frequency in an enclave with limited resources, and then re-deletes the rest of the data outside the enclave. Experimental results show that DEBE is superior to the existing encryption-before-deletion-based method in performance, storage efficiency and safety.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims

1. The safe re-deleting storage system based on the trusted execution protection and the encryption after the re-deleting comprises a client, a data channel, a control channel and a cloud server, wherein the client is connected with the cloud server through the data channel and the control channel and is used for uploading a plaintext data block of a user to an enclave of the cloud through the data channel; the cloud server is used for maintaining global fingerprint indexes to track data blocks stored by all clients, removing repeated data blocks in the enclave, encrypting non-repeated plaintext data blocks, and finally storing ciphertext data blocks in the storage pool; the data channel is used for transmitting plaintext data blocks initiated by a client, and the control channel is used for transmitting stored related operation commands;

the cloud server is provided with an enclave, a storage pool and a complete index module, wherein the enclave is in communication connection with the enclave module, the output end of the enclave is connected with the input end of the storage pool, the enclave is used for deleting data again, confidentiality of a plaintext data block is guaranteed in the deleting process, a non-repeated plaintext data block is compressed, and the compressed plaintext data block is encrypted; the storage pool is used for storing ciphertext data blocks in the storage pool by the enclave, and the complete indexing module is used for completely tracking all non-repeated data block fingerprints;

The system comprises an enclave, a frequency tracking unit, a key management unit, a compression unit and an encryption unit, wherein the output end of the frequency tracking unit is connected with the input end of the frequency-based data re-deletion unit, the output end of the key management unit is respectively connected with the input end of the frequency-based data re-deletion unit and the input end of the encryption unit, the output end of the frequency-based data re-deletion unit is connected with the input end of the compression unit, and the output end of the compression unit is connected with the input end of the encryption unit;

the frequency-based data re-deleting unit divides re-deleting into two stages according to the frequency of the data blocks and removes all repeated plaintext data blocks; the frequency-based data re-deleting unit comprises a first-stage re-deleting unit and a second-stage re-deleting unit, wherein the first-stage re-deleting unit is used for maintaining a small fingerprint index of an enclave and re-deleting k data blocks which occur most frequently; and the second-stage re-deleting is used for carrying out second-stage re-deleting on the repeated data blocks which are not removed in the first stage, wherein the repeated data blocks comprise infrequently-occurring data blocks and newly-occurring frequent data blocks.

2. The secure deduplication storage system based on trusted execution protection, as claimed in claim 1, wherein the key management unit comprises a data key, a query key, and a session key, the data key being used to encrypt and decrypt a compressed non-duplicate plaintext block of data in secure storage; the inquiry key is used for protecting plaintext data block information when inquiring the complete index outside the enclave; the session key is used for each client to maintain a data channel with the enclave for secure data communication, and each data channel uses a short-term session key to protect its data channel, which key remains valid for a single communication session.

3. The secure re-delete-before-encrypt storage system based on trusted execution protection of claim 2, wherein said frequency tracking unit is configured to track the frequency of plaintext data blocks in the enclave to identify data blocks that occur at high frequencies and that occur at non-high frequencies, thereby enabling frequency-based data re-deletion.

4. The secure re-delete-before-encrypt storage system based on trusted execution protection of claim 3, wherein said first stage performs re-delete with the enclave taking as input the fingerprint of the block of plaintext data, obtaining from CM-sktech a current estimated frequency of the block of plaintext data; the enclave checks the frequency of the small top heap root node, if the current estimated frequency of the obtained plaintext data block is smaller than the frequency of the small top heap root node, the plaintext data block is a data block which does not frequently occur, the enclave skips the process of further inquiring the hash table, and the second stage re-deletion is directly carried out; if the current estimated frequency of the obtained plaintext data block is larger than the frequency of the root node of the small top heap, the plaintext data block is a frequently occurring data block, and the enclave further queries the hash table by using the fingerprint of the plaintext data block.

5. The secure re-delete-before-encrypt storage system based on trusted execution protection of claim 4, wherein the second stage re-delete enclave encrypts the fingerprint of the plaintext data block using a query key, queries a complete index outside the enclave by OCall based on the encrypted plaintext data block fingerprint; if the encrypted fingerprint is found in the complete index, then the OCall returns encrypted data block information, which is to be decrypted in the enclave using the query key, and the enclave updates the address and the compressed size of the data block into the file allocation table; if the encrypted fingerprint is a new fingerprint of the full index, the enclave treats the data block as a non-duplicate data block and assigns an address to the data block, compresses the data block and records its compressed size.

6. The secure deduplication storage system based on trusted execution protection as described in claim 5, wherein the enclave in the key management unit compresses the deduplicated non-duplicate plaintext data blocks and encrypts the compressed non-duplicate plaintext data blocks into ciphertext data blocks and writes the ciphertext data blocks into a container buffer within the enclave, and when the buffer is full, the enclave sets the buffer contents as immutable and releases them to the cloud server for persistent storage.

7. The trusted execution protection based deduplication and encryption after first secure deduplication storage system of claim 6, wherein the enclave creates a file allocation table for each newly uploaded file, each entry in the file allocation table recording the address of a data block and the compressed size of the data block; when the enclave updates the file allocation table, there is no need to perform compression again on the repeated data blocks to obtain their compressed sizes, since the compressed data block sizes are stored in the top-k index and the full index.