CN108400970B

CN108400970B - Similar data message locking, encrypting and de-duplicating method in cloud environment and cloud storage system

Info

Publication number: CN108400970B
Application number: CN201810055819.6A
Authority: CN
Inventors: 姜涛; 袁浩然; 陈晓峰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-01-20
Filing date: 2018-01-20
Publication date: 2020-10-02
Anticipated expiration: 2038-01-20
Also published as: CN108400970A

Abstract

The invention belongs to the technical field of cloud storage, and discloses a method for locking, encrypting and deduplicating similar data messages in a cloud environment. The encryption algorithm constitutes the technical route of the present invention. Compared with the existing data deduplication method, the implementation of this scheme can further improve the efficiency of the existing scheme for deduplication, can further improve the utilization rate of cloud server storage space, and can further reduce the computing overhead of users and cloud servers. and storage overhead. The invention can realize safe and efficient deduplication of similar ciphertext data; the invention also adopts the Hamming distance reduction and label cutting optimization method, and can improve the cloud server's ability to perform label query through the Hamming distance reduction and label cutting optimization method. The experimental results show that the present invention is efficient in terms of storage and communication costs.

Description

Similar data message locking, encryption and deduplication method in cloud environment, cloud storage system

技术领域technical field

本发明属于云存储技术领域，尤其涉及一种云环境中相似数据消息锁定加密去重方法、云存储系统。The invention belongs to the technical field of cloud storage, and in particular relates to a method for locking, encrypting and deduplicating similar data messages in a cloud environment, and a cloud storage system.

背景技术Background technique

现如今，每天都有大量的数据被产生和处理。国际数据公司关于数字领域的研究指出到2020年互联网中的数据将达到40000EB，并且数据将持续以每两年翻倍的速度增长。云计算给数据存储方案的范式带来了转变。通过提供可靠的、可扩展的、按需的云存储服务并收取相对低廉的价格，对个人和企业数据管理带来了极大的便利性。思科全球云指数指出所有数据中心流量的83％来自于云，到2019年，数据中心工作量的80％将在云上被处理。根据现在的研究表明，主存储器中的20％到30％数据是冗余的。具体来说，通过在备份存储中实现全部文件的重复数据删除技术，将节约超过50％的标准文件系统的存储空间，超过72％的备份文件系统的存储空间。因此，数据去重技术可以有效的减轻数据存储的压力，通过删除冗余数据减少网络通信量，从而提高系统服务质量。一系列的在线/离线存储系统已经提供了数据去重功能，如IBM公司的InfoSphere QualityStage服务和FirstLogic公司的SAP Data Services等商业数据集成工具支持的重复数据检测。有许多基于聚类的技术，分类技术，链接分析技术或统计技术用于检测重复记录。还有很多软件旨在检测和消除相同或类似的重复数据，如Duplicate Cleaner，VisiPics和DupeGuru。然而，由于在云存储系统中用户将失去其数据的物理控制权，这使得用户数据的安全成为最大的诉求。因此，为了保护云用户敏感数据的安全性，用户将数据外包前通常会对数据进行加密。然而，加密技术的目标是提供明文数据的语义安全性，使得密文数据与随机数据不可区分。因此，在多用户云存储系统中，如何在保护数据安全性的前提下实现数据去重成为了关键且极具挑战的问题。为了解决这个问题，收敛加密方案被提出。在收敛加密中，通过使用文件的哈希值作为收敛密钥，相同的数据总能得到相同的密钥，通过使用收敛密钥加密解密数据，这使得密文去重得以实现。这种加密方案被形式化定义为消息锁定加密，通过从相同的数据得到相同的密钥进行数据加密的方法，使得云服务器可以判断两个密文数据是否是由相同的明文得到的。之后，一系列新的消息锁定加密方案力图提高方案的安全性或者提供其他新的特性。然而那些方案均只考虑了相同数据的去重并不能应用于实现相似数据的去重。许多实际的系统需要在数据检测，数据清理和数据聚合等情况下进行重复数据去重或搜索相似的数据项，如差错，拼写错误和内容不一致。一些相似数据检索方案和去重系统已经被提出，还有很多方案和软件被用来删除内容相似的网页、文本文档、音乐、图片、视频或者是本地磁盘的二进制数据。然而，那些方案和软件主要解决明文数据的去重而不是密文数据。进一步的，他们的方案适用于个人使用而不适用于云环境下的多用户场景。因此，现有的方案很难直接应用于云环境下安全的相似数据去重。尽管现在有方案可以支持隐私保护的云环境中的相似图像去重，但是他们假设存在一个用户群组并且加密密钥在群组中被分享。然而云环境中用户很难知道拥有同样数据的其他用户。一般来说，云环境下相似数据去重的挑战是云用户很难与其他用户沟通协商一个共同的加密密钥，而且云服务器也很难判断两个密文是否是由相似的数据加密得来的。Today, huge amounts of data are generated and processed every day. Research by the International Data Corporation on the digital realm points out that by 2020, the data on the Internet will reach 40,000 exabytes, and the data will continue to double at a rate of doubling every two years. Cloud computing brings a paradigm shift in data storage solutions. It brings great convenience to personal and enterprise data management by providing reliable, scalable, on-demand cloud storage services at relatively low prices. The Cisco Global Cloud Index states that 83 percent of all data center traffic originates in the cloud, and by 2019, 80 percent of data center workloads will be processed on the cloud. According to current research, 20% to 30% of the data in main memory is redundant. Specifically, by implementing the deduplication technology of all files in the backup storage, it will save more than 50% of the storage space of the standard file system and more than 72% of the storage space of the backup file system. Therefore, data deduplication technology can effectively reduce the pressure of data storage, reduce network traffic by removing redundant data, and improve system service quality. A series of online/offline storage systems already provide data deduplication, such as IBM's InfoSphere QualityStage service and FirstLogic's SAP Data Services and other commercial data integration tools support duplicate data detection. There are many clustering based techniques, classification techniques, link analysis techniques or statistical techniques used to detect duplicate records. There are also plenty of software designed to detect and eliminate identical or similar duplicate data, such as Duplicate Cleaner, VisiPics and DupeGuru. However, since users will lose physical control of their data in the cloud storage system, the security of user data becomes the biggest appeal. Therefore, in order to protect the security of cloud users' sensitive data, users usually encrypt the data before outsourcing it. However, the goal of cryptography is to provide the semantic security of plaintext data, making ciphertext data indistinguishable from random data. Therefore, in a multi-user cloud storage system, how to achieve data deduplication under the premise of protecting data security has become a key and challenging issue. To solve this problem, convergent encryption schemes are proposed. In convergent encryption, by using the hash value of the file as the convergent key, the same data can always get the same key. By using the convergent key to encrypt and decrypt the data, this enables ciphertext deduplication. This encryption scheme is formally defined as message locking encryption. By obtaining the same key from the same data for data encryption, the cloud server can determine whether two ciphertext data are obtained from the same plaintext. Since then, a series of new message locking encryption schemes have tried to improve the security of the scheme or provide other new features. However, those schemes only consider the deduplication of the same data and cannot be applied to realize the deduplication of similar data. Many practical systems require deduplication or searching for similar data items such as errors, spelling errors, and content inconsistencies in cases such as data inspection, data cleaning, and data aggregation. Some similar data retrieval schemes and deduplication systems have been proposed, and many schemes and software are used to delete similar content of web pages, text documents, music, pictures, videos or binary data on local disks. However, those schemes and software mainly address deduplication of plaintext data rather than ciphertext data. Further, their scheme is suitable for personal use but not for multi-user scenarios in cloud environment. Therefore, the existing solutions are difficult to be directly applied to secure similar data deduplication in the cloud environment. Although there are currently schemes to support similar image deduplication in a privacy-preserving cloud environment, they assume that there is a user group and that encryption keys are shared among the group. However, it is difficult for users in a cloud environment to know other users who have the same data. Generally speaking, the challenge of deduplication of similar data in a cloud environment is that it is difficult for cloud users to communicate and negotiate a common encryption key with other users, and it is also difficult for cloud servers to determine whether two ciphertexts are encrypted by similar data. of.

综上所述，现有技术存在的问题是： To sum up, the problems existing in the prior art are :

(1)现有的消息锁定加密方案的密钥是通过计算其明文的哈希值得到的，而哈希函数的特性是即使明文有1比特不相同，得到的哈希值也截然不同。因此，使用传统消息锁定加密方案加密得到的密文不再具有相似性，云服务器无法判断两个密文数据的明文是否是相似的，所以现有的消息锁定加密方案很难直接应用于云环境下安全的相似数据去重。(1) The key of the existing message locking encryption scheme is obtained by calculating the hash value of its plaintext, and the characteristic of the hash function is that even if the plaintext is different by 1 bit, the obtained hash value is completely different. Therefore, the ciphertext encrypted by the traditional message locking encryption scheme is no longer similar, and the cloud server cannot judge whether the plaintext of the two ciphertext data is similar, so the existing message locking encryption scheme is difficult to directly apply to the cloud environment Deduplication of similar data under safe.

(2)另一方面，尽管有一些方案可以实现群组用户协商密钥并在群组内共享，然而云环境下用户可以随时随地上传数据，云服务器无法在用户上传数据之前知道所有的数据拥有者，因此群组用户共同协商密钥的方案也无法用于实现相似数据去重。(2) On the other hand, although there are some solutions that can realize group users negotiate keys and share them within the group, users can upload data anytime and anywhere in the cloud environment, and the cloud server cannot know that all the data has ownership before users upload data. Therefore, the scheme of group users negotiating keys together cannot be used to deduplicate similar data.

解决上述技术问题的难度和意义： The difficulty and significance of solving the above technical problems :

(1)如何突破现有消息锁定加密方案的限制，使得相似的数据在加密之后依然是相似的，是相似数据消息锁定加密去重方法需要解决的问题。(1) How to break through the limitations of the existing message locking encryption scheme, so that similar data remains similar after encryption, is a problem that needs to be solved by the similar data message locking encryption and deduplication method.

(2)通过实现相似数据消息锁定加密方案的构造，可以用于实现相似数据加密去重系统，进而使云服务器可以实现相似数据的密文去重，这将进一步提高密文去重的效率，节省云服务器大量的存储资源与管理资源。(2) By realizing the structure of the similar data message locking encryption scheme, it can be used to realize the similar data encryption and deduplication system, so that the cloud server can realize the ciphertext deduplication of similar data, which will further improve the efficiency of ciphertext deduplication, Save a lot of storage resources and management resources of cloud servers.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供了一种云环境中相似数据消息锁定加密去重方法、云存储系统。In view of the problems existing in the prior art, the present invention provides a method for locking, encrypting and deduplicating similar data messages in a cloud environment, and a cloud storage system.

本发明是这样实现的，一种云环境中相似数据消息锁定加密去重方法，所述云环境中相似数据消息锁定加密去重方法使用相似性保留哈希函数(如SimHash或PHash)使得相似的数据可以获得相似的标签，基于纠错码的密钥提取方法使得具有相似的明文数据总能得到相同加密密钥，基于伪随机生成器的安全对称加密算法对相似数据消息锁定加密去重；用户如果希望上传数据，首先使用相似度保留哈希算法来生成明文的去重标签并发送给云服务器，云服务器判断是否有相似的数据已经存储在云服务器上，若云服务器不拥有相似的数据，则需要用户生成相似数据密钥和用于相似密钥恢复的辅助信息，并将加密后的密文数据和辅助信息发送给云服务器；若云服务器拥有相似的数据，则返回用于恢复相似密钥的辅助信息给用户，用户通过恢复出来的相似密钥对数据进行加密，并用得到的密文与服务器进行相似数据拥有验证，若通过验证，则云服务器允许用户访问数据。此外，本发明还通过汉明距离缩减和标签切割最优化的方法提高标签查询效率。The present invention is implemented as follows: a method for locking, encrypting and deduplicating similar data messages in a cloud environment, wherein the method for locking, encrypting and deduplicating similar data messages in a cloud environment uses a similarity preserving hash function (such as SimHash or PHash) to make similar Data can get similar labels. The key extraction method based on error correction code makes it possible to always obtain the same encryption key for data with similar plaintext. The secure symmetric encryption algorithm based on pseudo-random generator locks, encrypts and deduplicates similar data messages; If you want to upload data, first use the similarity-preserving hash algorithm to generate a plaintext deduplication label and send it to the cloud server. The cloud server determines whether there is similar data already stored on the cloud server. If the cloud server does not have similar data, Users are required to generate similar data keys and auxiliary information for similar key recovery, and send the encrypted ciphertext data and auxiliary information to the cloud server; The auxiliary information of the key is given to the user. The user encrypts the data with the recovered similar key, and uses the obtained ciphertext to verify the ownership of the similar data with the server. If the verification is passed, the cloud server allows the user to access the data. In addition, the present invention also improves the efficiency of label query by means of Hamming distance reduction and label cutting optimization.

进一步，所述云环境中相似数据消息锁定加密去重方法包括以下步骤：Further, the method for locking, encrypting and deduplicating similar data messages in the cloud environment includes the following steps:

客户端首先使用相似度保留哈希算法来生成明文的去重标签并发送给云服务器，云服务器判断是否有相似的数据已经存储在云服务器上；The client first uses the similarity-preserving hash algorithm to generate a clear-text deduplication label and send it to the cloud server, and the cloud server determines whether there is similar data already stored on the cloud server;

若云服务器不拥有相似的数据，则需要用户生成相似数据密钥和用于相似密钥恢复的辅助信息，并将加密后的密文数据和辅助信息发送给云服务器；If the cloud server does not have similar data, the user is required to generate a similar data key and auxiliary information for similar key recovery, and send the encrypted ciphertext data and auxiliary information to the cloud server;

若云服务器拥有相似的数据，则返回用于恢复相似密钥的辅助信息给用户，用户通过恢复出来的相似密钥对数据进行加密，并用得到的密文与服务器进行相似数据拥有验证，若通过验证，则云服务器允许用户访问数据。If the cloud server has similar data, it will return auxiliary information for recovering the similar key to the user. The user encrypts the data with the recovered similar key, and uses the obtained ciphertext to verify the ownership of the similar data with the server. After verification, the cloud server allows the user to access the data.

进一步，所述云环境中相似数据消息锁定加密去重方法使用[n,k,2t+1]_F的纠错码C，基于汉明距离的安全略图的思想是使用纠错码C对数据w进行纠错；输入w，均匀随机选择码字c∈C，令s＝SS(w)＝w-c是c到w所需的变换；计算Rec(w',s)，通过公式c'＝w'-s然后解码c'得到c；通过w＝c+s得到w。Further, the method of locking, encrypting and deduplicating similar data messages in the cloud environment uses the error correction code C of [n, k, 2t+1] _F , and the idea of the security sketch based on the Hamming distance is to use the error correction code C to correct the data w Perform error correction; input w, uniformly and randomly select the codeword c∈C, let s=SS(w)=wc be the transformation required from c to w; calculate Rec(w',s), through the formula c'=w' -s then decode c' to get c; get w by w=c+s.

进一步，所述云环境中相似数据消息锁定加密去重方法客户端应用相似度保留哈希算法来生成明文的去重标签和相似数据密钥；使用相似性保留哈希，相似的明文数据将映射到具有特定长度的相似标签和相似数据密钥；在特定的汉明距离内的相似数据总能得到相同的随机加密密钥，第一个用户选择一些辅助参数并计算明文w'的随机密钥；辅助参数将存储在云服务器上；当随后的用户拥有相似明文数据w(w≈w')的标签t_w并且想要执行相似数据去重操作，云服务器将发送辅助参数给之后的用户，之后的用户通过运行密钥重生成算法生成密钥k_w；如果文件w和文件w'的汉明距离小于特定的值，(f_w,f_w')＜t，则密钥重生成算法将输出相同的随机密钥k_w'＝k_w。Further, in the cloud environment, similar data messages are locked, encrypted, and deduplicated. The client applies a similarity-preserving hash algorithm to generate a plaintext deduplication label and a similar data key; using the similarity-preserving hash, similar plaintext data will be mapped to to similar labels and similar data keys with a certain length; similar data within a certain Hamming distance always get the same random encryption key, the first user chooses some auxiliary parameters and computes the random key for the plaintext w'; Auxiliary parameters will be stored on the cloud server; when a subsequent user has a tag tw of similar plaintext data _w (w≈w') and wants to perform a similar data deduplication operation, the cloud server will send the auxiliary parameter to the subsequent user, Subsequent users generate the key k _w by running the key regeneration algorithm; if the Hamming distance between the file w and the file w' is less than a specific value, (f _w , f _w' )<t, the key regeneration algorithm will The same random key _kw' = _kw is output.

进一步，所述云环境中相似数据消息锁定加密去重方法的相似消息锁定加密方案由六个多项式时间算法构成(FKG,KG,REP,ENC,DEC,TAG)：Further, the similar message locking encryption scheme of the similar data message locking encryption and deduplication method in the cloud environment is composed of six polynomial time algorithms (FKG, KG, REP, ENC, DEC, TAG):

FKG(1^λ,r₂,w)→fk_w：是一个基于相似保留哈希函数的相似密钥生成算法，用于让用户计算数据的摘要信息；以安全参数λ、随机数r₂∈{0,1}^λ和文件w作为输入，输出一个文件的相似摘要fk_w；FKG(1 ^λ ,r ₂ ,w)→fk _w : It is a similar key generation algorithm based on similarity retention hash function, which is used to allow users to calculate the summary information of data; with security parameters λ, random numbers r ₂ ∈ { 0,1} ^λ and file w as input, output a similar summary fk _w of a file;

RKG(1^λ,r₃,fk_w)→{k_w,P_w}：是一个密钥生成算法，用于让用户计算数据的加密密钥和辅助参数；x是一个公开参数，RKG算法使用安全略图的略图算法SS{r₃,w}→P_w和模糊提取器中的提取算法Ext(w,x)→{K_w}生成辅助参数P＝{x,s}和一个随机加密密钥K_w，其中r₃是一个随机参数用于生成一个随机的编码C(r₃)→c算法C(·)是一个编码生成算法，编码c用于安全略图中的SS算法；RKG(1 ^λ ,r ₃ ,fk _w )→{k _w ,P _w }: It is a key generation algorithm, which is used to allow users to calculate the encryption key and auxiliary parameters of the data; x is a public parameter, and the RKG algorithm uses The sketch algorithm SS{r ₃ ,w}→P _w of the secure sketch and the extraction algorithm Ext( _w ,x)→{Kw} in the fuzzy extractor generate auxiliary parameters P={x,s} and a random encryption key K _w , where r ₃ is a random parameter used to generate a random code C(r ₃ )→c algorithm C(·) is a code generation algorithm, and the code c is used for the SS algorithm in the security sketch;

REP(fk_w',P_w)→k_w：是一个密钥再生算法，由用户运行，通过将辅助参数P_w和文件的模糊摘要fk_w'作为输入，当且仅当fk_w'与fk_w相似的时候输出私钥k_w；否则输出一个随机值；REP(fk _w' ,P _w )→k _w : is a key regeneration algorithm, run by the user, by taking the auxiliary parameter P _w and the fuzzy digest fk _w' of the file as input, if and only if fk _w' and fk When _w is similar, output the private key k _w ; otherwise, output a random value;

ENC(k_w,w)→c_w：是一个加密算法，由用户运行用来计算加密数据并得到相应的密文，以文件w和一个私钥k_w作为输入，返回密文

其中G(k_w)→{0,1}^|w|是伪随机生成器，以k_w作为输入并输出长度为|w|的伪随机加密密钥G(k_w)；ENC(k _w ,w)→c _w : is an encryption algorithm, which is run by the user to calculate the encrypted data and obtain the corresponding ciphertext, taking the file _w and a private key kw as input, and returning the ciphertext

where G(k _w )→{0,1} ^|w| is a pseudo-random generator that takes k _w as input and outputs a pseudo-random encryption key G(k _w ) of length |w|;

DEC(k_w,c_w)→w：是一个解密算法，由用户运行用来计算输入数据的明文；它以密文c_w和一个私钥k_w作为输入，返回明文

DEC(k _w ,c _w )→w: is a decryption algorithm run by the user to calculate the plaintext of the input data; it takes the ciphertext c _w and a private key k _w as input, and returns the plaintext

TAG(1^λ,r₁,w)→t_w：是一个标签生成算法，通过使用相似保留哈希函数实现，由用户运行用来计算输入数据的摘要。它以安全参数λ，随机数r₁和数据w为输入，返回数据标签t_w。TAG(1 ^λ ,r ₁ ,w)→t _w : is a tag generation algorithm, implemented by using a similarity-preserving hash function, run by the user to compute a digest of the input data. It takes as input a security parameter λ, a random number r ₁ and data w and returns a data label _tw .

本发明的另一目的在于提供一种应用所述云环境中相似数据消息锁定加密去重方法的云存储系统。Another object of the present invention is to provide a cloud storage system applying the method for locking, encrypting, and deduplicating similar data messages in the cloud environment.

综上所述，本发明的优点及积极效果为：能够实现安全和高效的相似数据去重的方案，叫做模糊的消息锁定加密方案(FuzzyMLE)；采用相似性保留哈希函数，基于纠错码的密钥提出方法和基于伪随机生成器的安全对称加密算法构成本发明的技术路线。另外，通过汉明距离缩减和标签切割最优化的方法提高标签查询效率。最后，分析了本发明的效率，并且通过建立一个实际的系统在公开的数据库上评估了本发明的开销。实验结果表明本发明在存储和通信开销方面是高效的。To sum up, the advantages and positive effects of the present invention are as follows: a safe and efficient scheme for deduplicating similar data, called Fuzzy Message Locking Encryption Scheme (FuzzyMLE); using a similarity preserving hash function, based on error correction codes The method for proposing the key and the secure symmetric encryption algorithm based on the pseudo-random generator constitute the technical route of the present invention. In addition, the label query efficiency is improved through Hamming distance reduction and label cutting optimization. Finally, the efficiency of the present invention is analyzed, and the cost of the present invention is evaluated on a public database by building an actual system. Experimental results show that the present invention is efficient in terms of storage and communication overhead.

本发明针对相似数据安全、高效的跨用户的数据去重。如果云服务器已经存储了用户A的数据，用户B的数据与用户A的数据相似，可以使云服务器在不需要与用户A通信的情况下实现密文去重。形式化定义了相似消息锁定加密方案并且构建了相似消息锁定加密系统。通过将多种技术进行改进和组合，克服了云存储系统中安全高效的相似重复数据去重的挑战。首先，相对于传统消息锁定加密方案中使用的传统密码Hash标签，采用相似保持Hash函数来处理相似数据，并为每个数据生成一个相似标签。其次，代替相同标签查询，Hamming标签查询得到了改进，并被用来提供高效的相似的数据查询功能。同时，采用基于纠错码的相似加密密钥生成方法，根据用户数据在数据相似的条件下生成相似数据加密密钥。而且，采用基于伪随机生成器的安全异或加密方案来替代常规的对称加密算法(例如AES)来实现加密操作。此外，本发明还通过引入汉明距离缩减和标签切割最优化方法，进一步提高了标签查询效率。The present invention aims at safe and efficient cross-user data deduplication for similar data. If the cloud server has stored the data of user A, and the data of user B is similar to the data of user A, the cloud server can realize ciphertext deduplication without communicating with user A. A similar message-locking encryption scheme is formally defined and a similar message-locking encryption system is constructed. By improving and combining multiple technologies, the challenge of safe and efficient deduplication of similar duplicate data in cloud storage systems is overcome. First, compared with traditional cryptographic hash tags used in traditional message locking encryption schemes, similarity-preserving hash function is used to process similar data, and a similarity tag is generated for each data. Second, instead of the same label query, Hamming label query is improved and used to provide an efficient similar data query function. At the same time, a similar encryption key generation method based on error correction code is used to generate similar data encryption keys according to user data under the condition of similar data. Also, a secure XOR encryption scheme based on a pseudo-random generator is used instead of a conventional symmetric encryption algorithm (eg, AES) to implement the encryption operation. In addition, the present invention further improves the label query efficiency by introducing the Hamming distance reduction and label cutting optimization methods.

云存储系统由一个远程云存储服务器(S)和一组客户端(Cs)组成，他们希望在S上存储敏感数据。为了保护数据的安全，Cs想要在上传数据之前加密其敏感的数据。为了减少S和Cs之间的存储开销和不必要的通信开销，S和Cs要实现对上传密文的安全重复数据去重。与现有的只能对相同数据进行安全重复数据去重的安全重复数据去重方法不同，本发明考虑更具挑战性的情况：相似数据的安全重复数据去重。为了提高通信效率，如果Cs已经在S的数据库中存储了一些相似的数据，则只有第一个用户需要将数据的密文上传到S上。实际上，类似于大多数现有的精确安全的重复数据去重方法，用户不需要在系统中直接相互通信。他们分别与S进行通信，S在需要时处理消息或转发消息。A cloud storage system consists of a remote cloud storage server (S) and a set of clients (Cs) who wish to store sensitive data on S. To keep the data safe, Cs want to encrypt their sensitive data before uploading it. In order to reduce the storage overhead and unnecessary communication overhead between S and Cs, S and Cs should implement secure deduplication of the uploaded ciphertext. Different from the existing secure duplicate data deduplication method that can only perform secure duplicate data deduplication on the same data, the present invention considers a more challenging situation: secure duplicate data deduplication of similar data. To improve communication efficiency, if Cs has stored some similar data in S's database, only the first user needs to upload the ciphertext of the data to S. In fact, similar to most existing accurate and secure deduplication methods, users do not need to communicate with each other directly in the system. They respectively communicate with S, which processes messages or forwards them when needed.

附图说明Description of drawings

图1是本发明实施例提供的云环境中相似数据消息锁定加密去重方法流程图。FIG. 1 is a flowchart of a method for locking, encrypting, and deduplicating similar data messages in a cloud environment provided by an embodiment of the present invention.

图2是本发明实施例提供的安全略图的示意图。FIG. 2 is a schematic diagram of a security sketch provided by an embodiment of the present invention.

图3是本发明实施例提供的模糊提取器的示意图。FIG. 3 is a schematic diagram of a blur extractor provided by an embodiment of the present invention.

图4是本发明实施例提供的相似数据锁定加密示意图。FIG. 4 is a schematic diagram of similar data locking and encryption provided by an embodiment of the present invention.

图5是本发明实施例提供的SimHash计算时间示意图。FIG. 5 is a schematic diagram of calculation time of SimHash provided by an embodiment of the present invention.

图6是本发明实施例提供的PHash计算时间示意图。FIG. 6 is a schematic diagram of a PHash calculation time provided by an embodiment of the present invention.

图7是本发明实施例提供的文本数据去重花费时间示意图7 is a schematic diagram of the time spent on deduplication of text data provided by an embodiment of the present invention

图8是本发明实施例提供的图像数据去重花费时间示意图。FIG. 8 is a schematic diagram of the time spent in deduplication of image data provided by an embodiment of the present invention.

图9是本发明实施例提供的测试硬件环境示意图。FIG. 9 is a schematic diagram of a testing hardware environment provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

随着数据的爆炸式增长，数据存储的高效性成为云存储系统需要实现的最重要的目标。大多数的云存储提供商使用数据去重以减轻数据的存储和管理花销。最近几年，为了进一步保护用户数据的隐私性，许多安全的去重方法已经被提出。同时，许多实际应用指出消除相似的(或者错误的)数据能够进一步减少云存储提供商的存储开销，并且能够提高数据存储质量。然而，云存储环境中仍然缺少安全和高效的相似数据去重方法。With the explosive growth of data, the efficiency of data storage has become the most important goal that cloud storage systems need to achieve. Most cloud storage providers use data deduplication to reduce data storage and management overhead. In recent years, in order to further protect the privacy of user data, many secure deduplication methods have been proposed. At the same time, many practical applications point out that eliminating similar (or erroneous) data can further reduce storage overhead for cloud storage providers and can improve data storage quality. However, there is still a lack of safe and efficient methods for deduplication of similar data in cloud storage environments.

如图1所示，本发明实施例提供的云环境中相似数据消息锁定加密去重方法包括以下步骤：As shown in FIG. 1 , the method for locking, encrypting and deduplicating similar data messages in a cloud environment provided by an embodiment of the present invention includes the following steps:

S101：客户端首先使用相似度保留哈希算法来生成明文的去重标签并发送给云服务器，云服务器判断是否有相似的数据已经存储在云服务器上。S101: The client first uses a similarity-preserving hash algorithm to generate a plaintext deduplication label and send it to the cloud server, and the cloud server determines whether there is similar data already stored on the cloud server.

S102：若云服务器不拥有相似的数据，则需要用户生成相似数据密钥和用于相似密钥恢复的辅助信息，并将加密后的密文数据和辅助信息发送给云服务器。S102: If the cloud server does not possess similar data, the user is required to generate a similar data key and auxiliary information for similar key recovery, and send the encrypted ciphertext data and auxiliary information to the cloud server.

S103:若云服务器拥有相似的数据，则返回用于恢复相似密钥的辅助信息给用户，用户通过恢复出来的相似密钥对数据进行加密，并用得到的密文与服务器进行相似数据拥有验证，若通过验证，则云服务器允许用户访问数据。S103: If the cloud server has similar data, return auxiliary information for recovering the similar key to the user, and the user encrypts the data with the recovered similar key, and uses the obtained ciphertext to verify the ownership of the similar data with the server, If verified, the cloud server allows the user to access the data.

下面结合附图对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below with reference to the accompanying drawings.

1、安全略图1. Safety outline

安全略图可以重构相似的数据并通过辅助信息并且精确地得到相同的数据。使M作为距离函数dis的度量空间，图2描述了安全略图的示意图。它的定义如下：The security sketch can reconstruct similar data and obtain the same data exactly with auxiliary information. Let M be the metric space of the distance function dis, Figure 2 depicts a schematic diagram of the security sketch. It is defined as follows:

一个参数为(M,m,m',t,)的安全略图由两个高效的随机略图算法和恢复算法组成(SS,Rec)。A secure sketch with parameters (M, m, m', t, ) consists of two efficient random sketch algorithms and a recovery algorithm (SS, Rec).

略图算法SS：以元素w∈M作为输入，并输出一个串s∈{0,1}^*。The sketch algorithm SS: takes an element w∈M as input and outputs a string s∈{0,1} ^* .

恢复算法Rec：以元素w'∈M和串s∈{0,1}^*最为输入。当dis(w,w')≤t时，则Rec(w',SS(w))＝w，而且当dis(w,w')≥t时，则不保证Rec的输出。The recovery algorithm Rec: takes the element w'∈M and the string s∈{0,1} ^* as the most input. When dis(w,w')≤t, then Rec(w',SS(w))=w, and when dis(w,w')≥t, the output of Rec is not guaranteed.

基于汉明距离的安全略图算法：为了从Fⁿ汉明距离纠错码获得一个安全略图，本发明使用[n,k,2t+1]_F的纠错码C。基于汉明距离的安全略图的思想是使用纠错码C对数据w进行纠错。举例来说，输入w，均匀随机选择码字c∈C，令s＝SS(w)＝w-c是c到w所需的变换。计算Rec(w',s)，通过公式c'＝w'-s然后解码c'得到c。因为dis(w,w')≤t，因此dis(c,c')≤t。最终通过w＝c+s得到w。Hamming Distance Based Safe Sketch Algorithm: In order to obtain a safe sketch from the ^Fn Hamming distance error correction code, the present invention uses an error correction code C of [n,k,2t+1] _F . The idea of the Hamming distance-based security sketch is to use the error correction code C to correct the data w. For example, input w, uniformly randomly select a codeword c∈C, let s=SS(w)=wc be the required transformation of c to w. Calculate Rec(w',s), obtain c by formula c'=w'-s and then decode c'. Because dis(w,w')≤t, so dis(c,c')≤t. Finally, w is obtained by w=c+s.

2、模糊提取器2. Fuzzy Extractor

模糊提取器可以使两个相似的数据获得同样的字符串K。一个参数为(M,m,l,t,ε)的模糊提取器由一对有效的生成算法和再生成算法组成(KG,REP)。The fuzzy extractor can make two similar data get the same string K. A fuzzy extractor with parameters (M, m, l, t, ε) consists of a pair of efficient generation and regeneration algorithms (KG, REP).

生成算法KG(w)→{K,P}：以w∈M作为输入，输出一个提取串K∈{0,1}^l和一个公开的辅助串P∈{0,1}^*。Generating algorithm KG(w)→{K,P}: Take w∈M as input, output an extraction string K∈{0,1} ^l and a public auxiliary string P∈{0,1} ^* .

再生成算法REP(w',P)→{K}：以w'∈M和串P∈{0,1}^*作为输入，如果dis(w,w')≤t而且KG(w)→{K,P}，则REP(w',P)＝K。如果M的最小熵

那么(R,P,E)≈_ε(U,P,E)，则模糊提取器是安全的。The regeneration algorithm REP(w',P)→{K}: takes as input w'∈M and the string P∈{0,1} ^* , if dis(w,w')≤t and KG(w)→{ K,P}, then REP(w',P)=K. If the minimum entropy of M

Then (R,P,E)≈ _ε (U,P,E), then the fuzzy extractor is safe.

由于传统的哈希函数算法(如SHA-1或SHA-256)和对称加密算法(如AES-128或AES-256)不能直接应用于实现相似数据的安全重复数据去重。本发明整合了相似性保留哈希算法(SimHash和PHash)，基于纠错码的模糊密钥提取算法，基于一次性填充的异或加密方案等新技术，实现了本发明系统中安全高效的相似数据去重。相似数据去重方法可以实现客户数据的加密/解密，允许云服务器对用户的密文数据进行安全的相似数据去重。Because traditional hash function algorithms (such as SHA-1 or SHA-256) and symmetric encryption algorithms (such as AES-128 or AES-256) cannot be directly applied to achieve secure deduplication of similar data. The invention integrates the similarity preserving hash algorithm (SimHash and PHash), the fuzzy key extraction algorithm based on error correction code, the XOR encryption scheme based on one-time padding and other new technologies, and realizes the safe and efficient similarity in the system of the invention. Data deduplication. The similar data deduplication method can realize the encryption/decryption of customer data, allowing the cloud server to perform secure similar data deduplication on the user's ciphertext data.

相似数据去重方法是设计用来让客户端对数据进行加密，云服务器对密文进行相似重复数据检测。在本发明中，客户端首先应用相似度保留哈希算法来生成明文的去重标签和相似数据密钥。使用相似性保留哈希，相似的明文数据将映射到具有特定长度的相似标签和相似数据密钥(例如64位)。这些固定长度的标签还可以显著降低存储开销。本发明设计了一种基于模糊提取器的随机密钥生成算法，在特定的汉明距离内的相似数据总能得到相同的随机加密密钥。在这个阶段，第一个用户选择一些辅助参数并计算明文w'的随机密钥(例如k_w')。然后辅助参数将存储在云服务器上。当随后的用户拥有相似明文数据w(w≈w')的标签t_w并且想要执行相似数据去重操作，云服务器将发送辅助参数给之后的用户，之后的用户通过运行密钥重生成算法生成密钥k_w。如果文件w和文件w'的汉明距离小于特定的值，比如(f_w,f_w')＜t，则密钥重生成算法将输出相同的随机密钥k_w'＝k_w。The similar data deduplication method is designed to allow the client to encrypt the data, and the cloud server to perform similar duplicate data detection on the ciphertext. In the present invention, the client first applies the similarity preserving hash algorithm to generate the deduplication label and the similarity data key of the plaintext. Using similarity-preserving hashing, similar plaintext data is mapped to similar labels and similar data keys of a certain length (e.g. 64 bits). These fixed-length tags can also significantly reduce storage overhead. The invention designs a random key generation algorithm based on a fuzzy extractor, and similar data within a specific Hamming distance can always obtain the same random encryption key. At this stage, the first user chooses some auxiliary parameters and computes a random key (eg _kw' ) for the plaintext w'. The auxiliary parameters will then be stored on the cloud server. When a subsequent user has a tag tw of similar plaintext data _w (w≈w') and wants to perform a similar data deduplication operation, the cloud server will send auxiliary parameters to the subsequent user, and the subsequent user will run the key regeneration algorithm by Generate the key k _w . If the Hamming distance of file w and file w' is less than a certain value, such as (f _w , f _w' )<t, the key regeneration algorithm will output the same random key k _w' =k _w .

由于云服务器需要对用户敏感数据执行相似性检测，因此相似的数据必须加密成相似的密文。这将违背消息锁定加密采用的传统的加密方法。为了解决这一问题，本发明使用了简单的基于一次性填充生成器的异或加密算法。类似于流密码，一个伪随机生成器G (·)通过使用相似密钥生成足够的比特长度的加密密钥。如果有两个明文w和w'是相似的，则它们各自的相似密钥是k_w和k_w'且k_w'＝k_w；否则，k_w'≠k_w。本发明可以直观的得到

Since the cloud server needs to perform similarity detection on user sensitive data, similar data must be encrypted into similar ciphertext. This would go against the traditional encryption methods employed for message lock encryption. To solve this problem, the present invention uses a simple XOR encryption algorithm based on a one-time padding generator. Similar to stream ciphers, a pseudorandom generator G(·) generates encryption keys of sufficient bit length by using similar keys. If two plaintexts w and _w ' are similar, then their respective similar keys are _kw and _kw' and _kw' =kw; otherwise, _kw' ≠ _kw . The present invention can intuitively obtain

相似消息锁定加密方案由六个多项式时间算法构成(FKG,KG,REP,ENC,DEC,TAG)：Similar message locking encryption scheme consists of six polynomial time algorithms (FKG, KG, REP, ENC, DEC, TAG):

FKG(1^λ,r₂,w)→fk_w：这是一个基于相似保留哈希函数的相似密钥生成算法，用于让用户计算数据的摘要信息。它以安全参数λ、随机数r₂∈{0,1}^λ和文件w作为输入，输出一个文件的相似摘要fk_w。在实际使用中使用SimHash或PHash来实现。FKG(1 ^λ ,r ₂ ,w)→fk _w : This is a similarity key generation algorithm based on similarity preserving hash function, which is used to let users calculate the digest information of data. It takes as input a security parameter λ, a random number r ₂ ∈ {0,1} ^λ and a file w, and outputs a similar digest fk _w of a file. Use SimHash or PHash to achieve in actual use.

RKG(1^λ,r₃,fk_w)→{k_w,P_w}：这是一个密钥生成算法，用于让用户计算数据的加密密钥和辅助参数。x是一个公开参数，RKG算法使用安全略图的略图算法SS{r₃,w}→P_w和模糊提取器中的提取算法Ext(w,x)→{K_w}生成辅助参数P＝{x,s}和一个随机加密密钥K_w。其中r₃是一个随机参数用于生成一个随机的编码C(r₃)→c(算法C(·)是一个编码生成算法)。编码c用于安全略图中的SS算法。RKG(1 ^λ ,r ₃ ,fk _w )→{k _w ,P _w }: This is a key generation algorithm that lets users calculate encryption keys and auxiliary parameters for data. x is a public parameter, and the RKG algorithm uses the secure thumbnail sketch algorithm SS{r ₃ ,w}→P _w and the extraction algorithm Ext(w,x)→{K _w } in the fuzzy extractor to generate the auxiliary parameter P={x ,s} and a random encryption key K _w . where r ₃ is a random parameter used to generate a random code C(r ₃ )→c (algorithm C(·) is a code generation algorithm). The code c is used for the SS algorithm in the security sketch.

REP(fk_w',P_w)→k_w：这是一个密钥再生算法，由用户运行。类似于模糊提取器中的再生算法，通过将辅助参数P_w和文件的模糊摘要fk_w'作为输入，当且仅当fk_w'与fk_w相似的时候输出私钥k_w；否则输出一个随机值。REP(fk _w' ,P _w )→k _w : This is a key regeneration algorithm, run by the user. Similar to the regeneration algorithm in the fuzzy extractor, by taking the auxiliary parameter Pw and the fuzzy digest _fkw _' of the file as input, output the private key kw if and only if _fkw _' is similar to _fkw ; otherwise, output a random value.

ENC(k_w,w)→c_w：这是一个加密算法，由用户运行用来计算加密数据并得到相应的密文。它以文件w和一个私钥k_w作为输入，返回密文

其中G(k_w)→{0,1}^|w|是伪随机生成器，以k_w作为输入并输出长度为|w|的伪随机加密密钥G(k_w)。ENC(k _w ,w)→c _w : This is an encryption algorithm run by the user to calculate encrypted data and get the corresponding ciphertext. It takes as input the file w and a private key k _w and returns the ciphertext

where G(k _w )→{0,1} ^|w| is a pseudo-random generator that takes k _w as input and outputs a pseudo-random encryption key G(k _w ) of length |w|.

DEC(k_w,c_w)→w：这是一个解密算法，由用户运行用来计算输入数据的明文。它以密文c_w和一个私钥k_w作为输入，返回明文

DEC(kw, _cw )→ _w : This is a decryption algorithm run by the user to compute the plaintext of the input data. It takes as input the ciphertext c _w and a private key k _w and returns the plaintext

TAG(1^λ,r₁,w)→t_w：这是一个标签生成算法，通过使用相似保留哈希函数实现，可以对相似的数据生成相同的摘要。该算法由用户运行用来计算输入数据的摘要。它以安全参数λ，随机数r₁和数据w为输入，返回数据标签t_w。TAG(1 ^λ ,r ₁ ,w)→t _w : This is a tag generation algorithm implemented by using a similarity-preserving hash function that can generate the same digest for similar data. The algorithm is run by the user to compute a digest of the input data. It takes as input a security parameter λ, a random number r ₁ and data w and returns a data label _tw .

基于相似数据消息锁定加密方案的定义，本发明在图4给出了方案示意图。类似于消息锁定加密，所有的算法可能依赖于公开参数P_w，它对所有的参与方甚至敌手而言都是公开的。Based on the definition of the similar data message locking encryption scheme, the present invention provides a schematic diagram of the scheme in FIG. 4 . Similar to message lock encryption, all algorithms may rely on public parameters P _w , which are public to all participants and even adversaries.

下面结合具体实施例对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below with reference to specific embodiments.

在本发明的系统中，本发明假设用户Cs是数据的拥有者而且他们希望将其数据外包存储在云服务器上并进行相似去重存储。在用户上传数据之后，用户仅需保留每个数据条目(比如数据w)的身份链接(比如ID_w)和加密密钥(比如k_w)。通过从云服务器下载密文c_w从而解密得到明文数据w。云服务器S存储一个从用户得到的所有数据信息，并维持一个数据集DB＝{Tag,ID,Cipher}。在本发明的系统中，数据集DB提供三个必须的文件，也就是标签文件，身份链接文件和密文文件。In the system of the present invention, the present invention assumes that the user Cs is the owner of the data and they wish to outsource their data to be stored on the cloud server and perform similar deduplication storage. After the user uploads the data, the user only needs to keep the identity link (eg ID _w ) and encryption key (eg k _w ) of each data entry (eg data w ). The plaintext data w is obtained by decrypting the ciphertext c _w from the cloud server. The cloud server S stores all data information obtained from the user, and maintains a data set DB={Tag, ID, Cipher}. In the system of the present invention, the data set DB provides three necessary files, that is, the label file, the identity link file and the ciphertext file.

相似数据锁定加密方案由三个阶段组成，也就是系统建立阶段，上传阶段和下载阶段。由于上传阶段和下载阶段是两方交互式协议，本发明形式化定义交互式协议如下：Π:[P₁:in₁；P₂:in₂]→[P₁:out₁；P₂:out₂]。协议Π表示一个交互式协议被两个参与方P₁和P₂运行，in_i和out_i表示参与方P_i的输入和输出。相似数据锁定加密系统三个阶段的细节构造如下所示：The similar data locking encryption scheme consists of three stages, namely the system establishment stage, the upload stage and the download stage. Since the upload phase and the download phase are two-party interactive protocols, the present invention formally defines the interactive protocol as follows: Π:[P ₁ :in ₁ ; P ₂ :in ₂ ]→[P ₁ :out ₁ ;P ₂ :out ₂ ]. Protocol Π denotes _an interactive protocol run by _two parties P1 and P2, in _i and out _i denote the inputs and outputs of the parties P _i . The detailed structure of the three stages of a similar data locking encryption system is as follows:

系统建立阶段由用户C运行，其中r₁和r₂是两个公开参数，r₃是随机选择的参数，用于作为[n,k,2t+1]_F纠错码的输入。不失一般性，本发明假设用户A是数据w'的第一个数据拥有者并且他希望将数据上传到云存储服务器S上。用户A首先运行标签生成算法TAG(1^λ,r₁,w')→t_w'和相似密钥生成算法FKG(1^λ,r₂,w')→fk_w'生成数据w'的标签t_w'和相似数据摘要fk_w'。(实际上，标签生成算法TAG和相似密钥生成算法FKG都是用SimHash或PHash实现的，因此用户将

和

分别作为算法TAG和算法FKG的输入。)在此之后，用户A运行密钥生成算法RKG(1^λ,r₃,fk_w')→{k_w',P_w'}得到相似加密密钥k_w'和辅助参数P_w'。The system setup phase is run by user C, where r ₁ and r ₂ are two public parameters and r ₃ is a randomly chosen parameter used as input to the [n,k,2t+1] _F error correction code. Without loss of generality, the present invention assumes that user A is the first data owner of data w' and he wishes to upload the data to cloud storage server S. User A first runs the label generation algorithm TAG(1 ^λ ,r ₁ ,w')→t _w' and the similar key generation algorithm FKG(1 ^λ ,r ₂ ,w')→fk _w' to generate the label t of the data w'_w' and similar data summaries fk _w' . (Actually, both the tag generation algorithm TAG and the similar key generation algorithm FKG are implemented with SimHash or PHash, so users will

and

As the input of algorithm TAG and algorithm FKG, respectively. ) After that, user A runs the key generation algorithm RKG(1 ^λ , r ₃ , fk _w' )→{k _w' , P _w' } to obtain the similar encryption key k _w' and auxiliary parameter P _w' .

上传阶段是一个交互式的协议，运行在用户C和云服务器S之间。用户C首先发送标签t_w'给云服务器S，标签用于服务器S在它存储的数据库进行相似重复检测。在这个阶段中，有两个不同的情况发生在云服务器上：The upload phase is an interactive protocol that runs between user C and cloud server S. The user C first sends the tag _tw' to the cloud server S, and the tag is used for the server S to perform similar duplication detection in the database it stores. During this phase, two different things happen on the cloud server:

不存在重复数据，如果云服务器S现有的数据中不存在标签t_w与标签t_w'类似，则用户需要上传数据。上传阶段运行如下操作：Upload:[C:t_w',w',r₃；s:DedupTb]→[C:k_w',c_w',P_w',Link_w'；S:t_w',R_w',c_w',Link_w']。There is no duplicate data. If there is no label _tw similar to the label _tw' in the existing data of the cloud server S, the user needs to upload the data. The upload phase runs the following operations: Upload:[C:t _w' ,w',r ₃ ;s:DedupTb]→[C:k _w' ,c _w' ,P _w' ,Link _w' ;S:t _w' ,R _w' ,c _w' ,Link _w' ].

用户首先运行随机密钥生成算法RKG(1^λ,r₃,fk_w)→{k_w,P_w}生成随机的加密密钥和辅助参数。然后加密得到密文ENC(k_w,w')→c_w'并发送{t_w',P_w',c_w'}给云服务器S。S存储{t_w',P_w',c_w'}并返回链接Link_w'给用户C用于下载密文c_w'。The user first runs the random key generation algorithm RKG(1 ^λ , r ₃ , fk _w )→{k _w , P _w } to generate random encryption keys and auxiliary parameters. Then encrypt to get the ciphertext ENC(k _w ,w')→c _w' and send {t _w' ,P _w' ,c _w' } to the cloud server S. S stores {t _w' ,P _w' ,c _w' } and returns the link Link _w' to user C for downloading the ciphertext c _w' .

已存在重复数据，如果云服务器已经存储了数据w且数据w的标签t_w与标签t_w'类似，则上传阶段运行如下操作：Upload:[C:t_w',w'；s:DedupTb,P_w']→[C:k_w,c_w',Link_w；S:Link_w]。根据标签t_w'，云服务器S返回辅助信息P_w＝{x_w,s_w}给用户C。当用户接收到P_w＝{x_w,s_w}后，他首先运行密钥再生成算法REP(fk_w',P_w)→k_w。然后用户加密得到数据w的密文ENC(k_w,w')→c_w'。之后云服务器S和用户C执行相似数据拥有证明协议，相似数据拥有证明协议可以有效的验证用户的密文数据c_w'是否与服务器上存储的数据c_w相似。如果用户通过验证，则云服务器返回用户连接Link_w可以下载存储在云服务器S上的密文数据c_w，由于已经有相似数据存储在云服务器上，因此用户也不需要再次上传数据c_w'。Duplicate data already exists. If the cloud server has stored the data w and the tag tw of the data w is similar to the tag t _w _' , the upload stage will run the following operations: Upload:[C:t _w' ,w'; s:DedupTb, _Pw' ]→[C:kw, _cw _' , _Linkw ;S: _Linkw ]. According to the tag _tw' , the cloud server S returns the auxiliary information P _w ={x _w ,s _w } to the user C. When the user receives P _w ={x _w ,s _w }, he first runs the key regeneration algorithm REP(fk _w' ,P _w )→k _w . Then the user encrypts to obtain the ciphertext ENC(k _w ,w')→c _w' of the data w. Afterwards, the cloud server S and the user C execute the similar data ownership proof protocol, and the similar data ownership proof protocol can effectively verify whether the user's ciphertext data c _w' is similar to the data c _w stored on the server. If the user passes the verification, the cloud server returns the user to connect Link _w to download the ciphertext data c _w stored on the cloud server S. Since there is already similar data stored on the cloud server, the user does not need to upload the data c _w' again .

下载阶段是一个交互式的协议将由用户C发起用来获得服务器S上的外包数据。协议如下Download:[C:Link_w,k_w；s:DedupTb,c_w]→[C:w；s:⊥]；直观的，如果用户C想要从服务器S上下载w的密文，用户首先发送数据身份链接Link_w给服务器，服务器查询数据库DB寻找身份链接是Link_w的密文C_w。然后服务器将密文C_w发送给用户C。在收到密文C_w之后，用户C运行解密算法DEC(k_w,c_w)→w得到明文w。在这个过程中，用户首先运行伪随机生成算法得到解密秘钥G(k_w)并且计算明文

The download phase is an interactive protocol that will be initiated by user C to obtain outsourced data on server S. The protocol is as follows Download: [C:Link _w ,k _w ; s:DedupTb,c _w ]→[C:w; s:⊥]; Intuitively, if user C wants to download the ciphertext of w from server S, the user First, the data identity link Link _w is sent to the server, and the server queries the database DB to find the ciphertext C _{w whose identity link is Link w} _. The server then sends the ciphertext C _w to user C. After receiving the ciphertext Cw, user C runs the decryption algorithm DEC( _kw , _cw )→ _w to obtain the plaintext w. In this process, the user first runs the pseudo-random generation algorithm to obtain the decryption key G(k _w ) and calculates the plaintext

为了能进一步提高服务器在接收到标签t_w'之后找到相似的标签t_w的速度，并返回辅助信息P_w＝{x_w,s_w}给用户C，我们还设计了汉明距离缩减和标签切割最优化的方法提高标签查询效率。汉明距离缩减的思想如下：由于云服务器存储了大量的数据并拥有大量的标签，如果通过遍历所有的标签从而找到与标签t_w'相似的标签t_w，这样将带来巨大的计算开销。因此，我们设计了1bits(x)函数，1bits(x)函数用来统计数据x中1bit的个数。如果设置相似数据的阀值为t，则两个数据x和数据y相似必须满足-t≤1bits(x)-1bits(y)≤t。我们对存储在云服务器中的标签均计算其1bits(x)函数的值并降序排列。在服务器查找与标签t_w'相似的标签t_w时只在满足-t≤1bits(t_w')-1bits(t_w)≤t的标签里面进行查找。在找到满足-t≤1bits(t_w')-1bits(t_w)≤t的标签之后我们使用标签切割最优化进一步提高判定两个数据汉明距离的效率。其原理是首先将数据划分成同样大小的分块，从前往后计算每个分块的汉明距离，如果计算到某个块时的汉明距离已经大于t，则说明这两个数据一定不相似。因此，我们不再需要计算具体的两个数据的汉明距离，而是计算到某块时两个数据汉明距离已经超过t，则不需要继续计算并可判定两个数据并不相似。实际中，对于两个长度为n的数据x和数据y，我们分别将数据x和数据y分割为(x₁,x₂,...,x_r)和(y₁,y₂,...,y_r)。前面r-(nmodr)个串的长度为

后面nmodr个串的长度为

我们首先计算第1块的汉明距离dis_Ham(x₁,y₁)开始，一直计算到第r块的汉明距离dis_Ham(x_r,y_r)，若在计算到第i块时有dis_Ham(x₁,y₁)+...+dis_Ham(x_i,y_i)＞t。则说明两个数据不相似，服务器将不再继续计算之后块的汉明距离。In order to further improve the speed at which the server finds similar tags _tw after receiving tags _tw' , and returns auxiliary information P _w ={x _w ,s _w } to user C, we also design Hamming distance reduction and tag Cutting-optimized methods improve tag query efficiency. The idea of Hamming distance reduction is as follows: Since the cloud server stores a large amount of data and has a large number of labels, if a label tw similar to the label _tw _' is found by traversing all the labels, it will bring huge computational overhead. Therefore, we designed the 1bits(x) function, which is used to count the number of 1bits in the data x. If the threshold value of similar data is set to t, the two data x and data y are similar and must satisfy -t≤1bits(x)-1bits(y)≤t. We calculate the value of the 1bits(x) function for the tags stored in the cloud server and sort them in descending order. When the server searches for a tag _tw similar to the tag tw', it only searches in the tags that satisfy -t≤1bits( _tw _' )-1bits( _tw )≤t. After finding a label that satisfies -t≤1bits(t _w' )-1bits(t _w )≤t, we use label cutting optimization to further improve the efficiency of determining the Hamming distance of two data. The principle is to first divide the data into blocks of the same size, and calculate the Hamming distance of each block from front to back. If the Hamming distance of a block is already greater than t, it means that the two data must be different. resemblance. Therefore, we no longer need to calculate the Hamming distance of the specific two data, but when the Hamming distance of the two data has exceeded t when a certain block is calculated, there is no need to continue the calculation and it can be determined that the two data are not similar. In practice, for two data x and data y of length n, we split the data x and data y into (x ₁ ,x ₂ ,...,x _r ) and (y ₁ ,y ₂ ,... .,y _r ). The length of the first r-(nmodr) strings is

The length of the following nmodr strings is

We first calculate the Hamming distance dis _Ham (x ₁ , y ₁ ) of the first block, and continue to calculate the Hamming distance dis _Ham (x _r , y _r ) of the rth block. dis _Ham (x ₁ ,y ₁ )+...+dis _Ham (x _i ,y _i )>t. It means that the two data are not similar, and the server will not continue to calculate the Hamming distance of subsequent blocks.

为了进一步提高我们方案的安全性，我们还设计了基于辅助服务器的相似数据消息锁定加密去重方案和基于相似标签的相似数据锁定加密去重方案。基于辅助服务器的相似数据消息锁定加密去重方案通过结合基于RSA的盲签名方案抵抗离线蛮力攻击。假设我们系统的服务器使用的是RSA密钥生成算法，以参数e为输入，输出N和d使得

N是两个大素数的乘积。((N,e)，(N,d))是输出的私钥公钥对。每一个合法的用户首先在密钥服务器进行注册，输入密钥服务器的公钥和明文数据w，选择一个随机数r并通过FKG(r₂,w)算法计算fk_w，然后通过算法RKG(r₃,fk_w)计算得到k_w和P_w。最后用户计算x←H(k_w·r^e)并将x发送给密钥服务器。密钥服务器在收到x之后计算y←x^d modN并将y返回给用户。用户接收到y之后计算z←y·r^-1并验证是否z^emodN＝H(k_w)。如果相等则返回z，如果不相等则返回⊥。z用于通过使用伪随机生成算法计算明文w私有的加密密钥和相似数据验证标签t_w＝h(G(z))。在基于辅助服务器的相似数据消息锁定加密去重方案中，密钥服务器不能获得加密密钥的任何信息。在基于相似标签的相似数据锁定加密去重方案中，每一个数据(比如数据w)的询问标签是通过TAG(1^λ,)→t_w得到的。更准确的来说，

其中g是双线性群的生成元，h是抗碰撞哈希函数，r是随机数。假设用户C拥有数据w'。用户C首先计算fk'←FKG(r₂,w')然后运行密钥再生成算法计算

最后对于每一个记录用户验证云服务器S是否存在

和标签

详细来说，服务器S验证

是否相等。在发现相应的标签之后，用户C与服务器进行数据拥有证明协议。In order to further improve the security of our scheme, we also design a similar data message locking encryption and deduplication scheme based on auxiliary servers and a similar data lock encryption and deduplication scheme based on similar labels. Auxiliary server-based lock-encryption and deduplication scheme for similar data messages resists offline brute force attacks by combining with RSA-based blind signature scheme. Suppose the server of our system uses the RSA key generation algorithm, with parameter e as input, and output N and d such that

N is the product of two large prime numbers. ((N,e), (N,d)) is the output private key and public key pair. Each legitimate user first registers with the key server, enters the public key of the key server and plaintext data w, selects a random number r and calculates fk _w by the FKG(r ₂ ,w) algorithm, and then uses the algorithm RKG(r ₃ , fk _w ) is calculated to obtain k _w and P _w . Finally the user computes x←H( _kw · ^re ) and sends x to the key server. The key server computes y←x ^d modN after receiving x and returns y to the user. The user calculates z←y·r ^-1 after receiving y and verifies whether ^ze modN=H(k _w ). Returns z if equal, ⊥ if not. z is used to verify the label _tw =h(G(z)) by computing a private encryption key and similar data of the plaintext w using a pseudo-random generation algorithm. In a similar data message locking encryption deduplication scheme based on the auxiliary server, the key server cannot obtain any information about the encryption key. In the similar data locking encryption and deduplication scheme based on similar tags, the query tag of each data (such as data w) is obtained by TAG(1 ^λ ,)→t _w . More precisely,

where g is the generator of the bilinear group, h is the collision-resistant hash function, and r is a random number. Suppose user C has data w'. User C first calculates fk'←FKG(r ₂ ,w') and then runs the key regeneration algorithm to calculate

Finally, verify whether the cloud server S exists for each recorded user

and labels

In detail, server S authenticates

are equal. After discovering the corresponding tag, user C conducts a data ownership certification agreement with the server.

下面结合实验对本发明的应用效果作详细的描述。The application effect of the present invention will be described in detail below in conjunction with experiments.

本发明的系统在MySQL数据库系统上使用3000行C++代码实现。本发明利用免费的GMP库来实现SimHash算法。密码哈希算法和异或加密算法(SHA-256和异或加密算法)由OpenSSL库来实现。本发明在运行Linux 14.04的两台计算机上分别运行客户端和服务器应用程序，使用的计算机硬件配置为：1.70GHz Intel i5-3317U CPU，4GB内存。为了在局域网上进行实验，本发明实现了客户端与服务器之间的通信，并将这两台机器放在同一个区域。服务器和客户端之间的有线连接的通信带宽设置为10Mbps。为了测量本发明的系统在真实数据集上的性能，本发明使用了亚马逊电影评论文本数据集，包含7911684个文本文件，每个文本文件的长度约为1-15KB，图像数据超过1400万张图像。The system of the present invention is implemented on the MySQL database system using 3000 lines of C++ code. The present invention utilizes the free GMP library to realize the SimHash algorithm. The password hashing algorithm and the XOR encryption algorithm (SHA-256 and XOR encryption algorithm) are implemented by the OpenSSL library. The present invention runs client and server applications respectively on two computers running Linux 14.04, and the used computer hardware is configured as: 1.70GHz Intel i5-3317U CPU and 4GB memory. In order to conduct experiments on the local area network, the present invention realizes the communication between the client and the server, and places the two machines in the same area. The communication bandwidth of the wired connection between the server and the client is set to 10Mbps. In order to measure the performance of the system of the present invention on real datasets, the present invention uses the Amazon movie review text dataset, which contains 7,911,684 text files, each text file is about 1-15KB in length, and the image data exceeds 14 million images .

表1不同算法的计算时间Table 1 Computation time of different algorithms

对相似数据锁定加密系统的测试结果如表1所示。具体情况如下：FKG是相似哈希函数，本发明在本文中采用了64位的SimHash和64位的PHash进行实现(SimHash只能用于处理文本文件，PHash可以适用于文本和图像文件)。给定固定长度的文本数据(1KB)，SimHash和PHash的平均计算时间分别为1386us和5312us。给定一个图像JPEG数据(10KB)，PHash的平均计算时间为6439us。RKG算法在64和256位长度下计算分别需要261us和885us。REP算法在64和256位长度下分别需124us和368us。采用异或运算对ENC和DEC中的数据进行加密和解密，对1KB比特串执行ENC和DEC操作需要耗时338us。类似于FKG，TAG的实现也通过使用相似哈希函数实现的(即本发明的方案中使用的64位SimHash算法)。另外，本发明观察到FKG、ENC、DEC和TAG的计算时间与输入的大小有关，本发明在图5和图6分别给出模拟结果。随着输入数据量的大小的增加，它们的计算时间是线性增加的。The test results of similar data locking encryption systems are shown in Table 1. The specific conditions are as follows: FKG is a similar hash function, and the present invention adopts 64-bit SimHash and 64-bit PHash for implementation in this paper (SimHash can only be used to process text files, and PHash can be applied to text and image files). Given fixed-length text data (1KB), the average computation time for SimHash and PHash is 1386us and 5312us, respectively. Given an image JPEG data (10KB), the average computation time of PHash is 6439us. The RKG algorithm needs 261us and 885us to calculate at 64 and 256-bit lengths, respectively. The REP algorithm takes 124us and 368us at 64 and 256-bit lengths, respectively. The data in ENC and DEC are encrypted and decrypted by XOR operation, and it takes 338us to perform ENC and DEC operations on a 1KB bit string. Similar to FKG, the implementation of TAG is also achieved by using a similar hash function (ie the 64-bit SimHash algorithm used in the scheme of the present invention). In addition, the present invention observes that the calculation time of FKG, ENC, DEC and TAG is related to the size of the input, and the present invention presents the simulation results in FIG. 5 and FIG. 6 respectively. Their computation time increases linearly as the size of the input data increases.

最后，本发明用文本数据和图像数据来测试本发明的相似数据去重系统。如图7所示，实验运行在100,000个记录的文本数据库上，每个记录具有256位标记。如图8所示，实验运行在1000个记录的图像数据库上，每个记录具有64位标记。在这两个测试中，明文客户端将与服务器进行交互，验证数据库中是否存在相似的密文。如果服务器数据库中没有重复的类似数据，则客户端将上传其密文。否则，服务器将与用户进行相似数据去重，并将相似数据的链接发送给用户。Finally, the present invention tests the similar data deduplication system of the present invention with text data and image data. As shown in Figure 7, the experiments were run on a text database of 100,000 records, each with a 256-bit tag. As shown in Figure 8, the experiments were run on an image database of 1000 records, each with 64-bit labels. In both tests, the plaintext client will interact with the server to verify that a similar ciphertext exists in the database. If there are no duplicates of similar data in the server database, the client will upload its ciphertext. Otherwise, the server will de-duplicate similar data with the user, and send a link to the similar data to the user.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

1. A locking, encrypting and de-duplicating method for similar data messages in a cloud environment is characterized in that the locking, encrypting and de-duplicating method for similar data messages in the cloud environment adopts a similarity preserving hash function, and realizes the de-duplication of similar data by a secret key extraction method based on an error correcting code and a secure symmetric encryption algorithm based on a pseudo-random generator; the label query efficiency is improved by a Hamming distance reduction and label cutting optimization method;

a client of the similar data message locking encryption deduplication method in the cloud environment applies a similarity preserving hash algorithm to generate a deduplication label of a plaintext and a similar data key; using the similarity-preserving hash, similar plaintext data will be mapped to similar labels and similar data keys having a particular length; the same random encryption key can be always obtained from similar data within a specific Hamming distance, and a first user selects some auxiliary parameters and calculates the random key of a plaintext w'; the auxiliary parameters are stored on the cloud server; when the subsequent user has a label t of similar plaintext data w (w ≈ w')/t_wAnd when the similar data deduplication operation is required to be executed, the cloud server sends the auxiliary parameters to the later uploaded user, and the later uploaded user generates the key k by running a key regeneration algorithm_w(ii) a If the Hamming distance of file w and file w' is less than a specified value, if (f)_w,f_w') If t is less than t, the key regeneration algorithm outputs the same random key k_w'＝k_w；

The similar message locking encryption scheme of the similar data message locking encryption deduplication method in the cloud environment is formed by six polynomial time algorithms (FKG, KG, REP, ENC, DEC, TAG):

FKG(1^λ,r₂,w)→fk_w: the method is a similar key generation algorithm based on a similar reserved hash function and is used for enabling a user to calculate summary information of data; with a security parameter lambda, a random number r₂∈{0,1}^λThe similar abstract fk of a file is output by taking the file w as input_w；

RKG(1^λ,r₃,fk_w)→{k_w,P_w}: is a key generation algorithm for the user to calculate the encryption key and auxiliary parameters of the data; x is a public parameter, RKG algorithm uses the outline algorithm SS r of the safety outline₃,w}→P_wAnd the extraction algorithm Ext (w, x) → { K ] in the blur extractor_wGenerating auxiliary parameters P ═ x, s and a random encryption key K_wWherein r is₃Is a random parameter for generating a random code C (r)₃) Algorithm C (-) is a code generation algorithm, code C is used for the SS algorithm in the safety sketch;

REP(fk_w',P_w)→k_w: is a key regeneration algorithm, which is run by the user by applying the auxiliary parameter P_wAnd fuzzy summary fk of the file_w'As input, if and only if fk_w'And fk_wOutputting the private key k at similar times_w(ii) a Otherwise, outputting a random value;

ENC(k_w,w)→c_w: is an encryption algorithm operated by user to calculate encrypted data and obtain corresponding ciphertext, a file w and a private key k_wReturning as input ciphertext

Wherein G (k)_w)→{0,1}^|w|Is a pseudo-random generator, with k_wAs input and output a pseudorandom encryption key G (k) of length | w |_w)；

DEC(k_w,c_w) → w: is a decryption algorithm run by the user to calculate the plaintext of the input data; it uses the ciphertext c_wAnd a private key k_wReturning as input the plaintext

TAG(1^λ,r₁,w)→t_w: is a label generation algorithm, realized by using a similar retention hash function, operated by a user to calculate the abstract of input data; with a security parameter lambda, a random number r₁And data w as input, return data tag t_w。

2. The method for locking encryption and de-duplication of similar data messages in cloud environment according to claim 1, wherein the method for locking encryption and de-duplication of similar data messages in cloud environment comprises the following steps:

the client generates a duplicate removal label of a plaintext by using a similarity retention hash algorithm and sends the duplicate removal label to a cloud server, and the cloud server judges whether similar data are stored on the cloud server;

if the cloud server does not have similar data, the user is required to generate a similar data key and auxiliary information for similar key recovery, and encrypted ciphertext data and the auxiliary information are sent to the cloud server;

if the cloud server has similar data, returning auxiliary information for recovering the similar key to the user, encrypting the data by the user through the recovered similar key, and performing similar data ownership verification by using the obtained ciphertext and the server, wherein if the data is verified, the cloud server allows the user to access the data.

3. The method for locking encryption and de-duplication of similar data messages in cloud environment as claimed in claim 1, wherein the method for locking encryption and de-duplication of similar data messages in cloud environment uses [ n, k,2t +1 ]]_FThe idea of the hamming distance-based safety sketch is to use an error correction codeThe method comprises the steps of correcting error of data w, inputting w, uniformly and randomly selecting a code word C ∈ C, enabling s to be SS (w) to be w-C to be the transformation needed from C to w, calculating Rec (w ', s), obtaining C through a formula C' to w '-s and then decoding C', and obtaining w through w to C + s.