CN111782591A - Method for calculating file similar hash - Google Patents

Method for calculating file similar hash Download PDF

Info

Publication number
CN111782591A
CN111782591A CN202010575989.4A CN202010575989A CN111782591A CN 111782591 A CN111782591 A CN 111782591A CN 202010575989 A CN202010575989 A CN 202010575989A CN 111782591 A CN111782591 A CN 111782591A
Authority
CN
China
Prior art keywords
file
byte
hash
stream
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010575989.4A
Other languages
Chinese (zh)
Other versions
CN111782591B (en
Inventor
蒋遂平
姜涛
王颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Computer Technology and Applications
Original Assignee
Beijing Institute of Computer Technology and Applications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Computer Technology and Applications filed Critical Beijing Institute of Computer Technology and Applications
Priority to CN202010575989.4A priority Critical patent/CN111782591B/en
Publication of CN111782591A publication Critical patent/CN111782591A/en
Application granted granted Critical
Publication of CN111782591B publication Critical patent/CN111782591B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for calculating file similarity hash, which comprises the following steps: the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream; assembling the resulting bit streams into a new byte stream; if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash; and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files. The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system.

Description

Method for calculating file similar hash
Technical Field
The invention relates to a cloud storage technology, in particular to a method for calculating file similar hash.
Background
With the development of cloud computing technology, the number of files that people share in cloud storage is increasing dramatically. Most of the files stored in the cloud storage are the same or similar, which wastes a large amount of network bandwidth used by uploading the files and also wastes a large amount of storage space. If the similarity of two files can be judged quickly, people can eliminate the repeated or similar files before uploading the shared files, and the bandwidth and the storage resources can be saved.
Many methods for computing the hash of a file have been proposed, such as the secure hash algorithms SHA1, SHA2, SHA3, and the like. If the two files are completely the same, the two file hashes calculated by the algorithms are consistent; if the two files have small difference, the hash values of the two files calculated by the algorithms have large difference and cannot be used for judging the similarity of the files.
For this reason, fuzzy hash methods of files, such as SSDEEP, SDBF, TLSH, etc., have been proposed, and if two files are the same or similar, the two file hashes calculated by these algorithms are the same or similar, but the calculation process is relatively complicated.
An image is a special document. Image-aware hashing methods such as block-mean hashing, discrete cosine transform hashing, and 201910526184.8, a method for computing image-aware hashing, have been proposed. The image perception hash calculation is simple and can be widely applied.
If a similar hash of a file can be computed like an image-aware hash, it is possible to process image files and general files in a more uniform manner. The present invention has been developed in response to such real needs.
Disclosure of Invention
The present invention aims to provide a method for calculating a file similarity hash, which is used for solving the problems of the prior art.
The invention discloses a method for calculating file similarity hash, which comprises the following steps: the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream; assembling the resulting bit streams into a new byte stream; if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash; and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
An embodiment of a method for computing a file similarity hash according to the present invention is described in which if a file is similar to a hash of a fileWith n bytes from the beginning to the end and B bit stream1,B2,…,BnThen converted into a bit stream b of n bits1,b2,…,bnThe conversion process comprises the following steps: the method for adopting the cyclic sequential difference comprises the following steps: if B is presenti≥Bi+1Then b isi1, otherwise bi0, i-1, 2, …, n, when i-n, the value of i +1 is taken to be 1; the common difference method comprises the following steps: if B is presenti+Bi≥Bi-1+Bi+1Then b isi1, otherwise biWhen i is 1, the value of i-1 is taken as n, and when i is n, the value of i +1 is taken as 1.
According to an embodiment of the method for calculating the file similarity hash of the present invention, the obtained bit stream is b1,b2,…,bnIf the value of the number of bits n in the bitstream is not a multiple of 8, then at bnFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'1,B’2,…,B’mWherein m ═ n ÷ 8, B'i=bi*8bi*8+1bi*8+2…bi*8+7,i=1,2,…,m。
According to an embodiment of the method for calculating the file similarity hash of the present invention, if the length of the new byte stream is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system.
Drawings
FIG. 1 is a flow chart of a method of computing a file hash similarity according to the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention provides a method for calculating file similar hash, which comprises the following steps:
(1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream.
(2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream.
(3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
(4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
In the process of calculating the file similar hash, the method utilizes the cyclic sequential difference or the common difference to carry out the bit conversion of the file byte stream, can be completed by simple integer addition and subtraction operation, and is simple and quick. Meanwhile, the similar hash difference of similar files is small, the similar hash difference of different files is large, and the experience of a user on duplicate removal in a cloud storage system can be improved. Therefore, the present invention will play an important role in document retrieval and document deduplication.
FIG. 1 is a flow chart of a method of computing a file hash similarity according to the present invention. As shown in fig. 1, the method includes:
(1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream.
In the specific implementation, if a file has n bytes from beginning to end, the bit stream is B1,B2,…,BnThen converted into a bit stream b of n bits1,b2,…,bn. The conversion process is as follows:
adopting a cyclic sequential difference method: if B is presenti≥Bi+1Then b isi1, otherwise biWhen i is equal to n, the value of i +1 is taken to be 1.
Adopting a common difference method: if B is presenti+Bi≥Bi-1+Bi+1Then b isi1, otherwise biWhen i is 1, the value of i-1 is taken as n. When i is equal to n, the value of i +1 is taken to be 1.
(2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream.
In practice, if the bit stream is b1,b2,…,bnIf the value of the number of bits n in the bitstream is not a multiple of 8, then at bnFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'1,B’2,…,B’mWherein m ═ n ÷ 8, B'i=bi*8bi*8+1bi*8+2…bi*8+7,i=1,2,…,m。
(3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
In specific implementation, if the length of the new byte stream is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
(4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
In specific implementation, a conventional edit distance method can be adopted for calculating the difference between the similar hash bit streams of the two files, and when the edit distance is calculated, the cost of inserting 1 bit is 1, and the cost of deleting 1 bit is 1. The smaller the editing distance of the similar hash of two files is, the more similar the corresponding two files are.
The invention discloses a method for calculating file similar hash, which comprises the following steps: (1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream. (2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream. (3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained. (4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system. The method can meet the requirement of simply and quickly judging the similarity of the two files in the file duplication removing function of the cloud storage system.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (4)

1. A method for calculating file similarity hash, comprising:
the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream;
assembling the resulting bit streams into a new byte stream;
if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash;
and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
2. The method for computing the file similarity hash as claimed in claim 1, wherein if a file has n bytes from the beginning to the end, the bit stream is B1,B2,…,BnThen converted into a bit stream b of n bits1,b2,…,bnThe conversion process comprises the following steps:
the method for adopting the cyclic sequential difference comprises the following steps: if B is presenti≥Bi+1Then b isi1, otherwise bi0, i-1, 2, …, n, when i-n, the value of i +1 is taken to be 1;
the common difference method comprises the following steps: if B is presenti+Bi≥Bi-1+Bi+1Then b isi1, otherwise biWhen i is 1, the value of i-1 is taken as n, and when i is n, the value of i +1 is taken as 1.
3. The method of claim 2, wherein the obtained bit stream is b1,b2,…,bnIf the value of the number of bits n in the bitstream is not a multiple of 8, then at bnFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'1,B’2,…,B’mWherein m ═ n ÷ 8, B'i=bi*8bi*8+1bi*8+2…bi*8+7,i=1,2,…,m。
4. The method for computing the file similarity hash as claimed in claim 1, wherein if the new byte stream length is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
CN202010575989.4A 2020-06-22 2020-06-22 Method for calculating file similarity hash Active CN111782591B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010575989.4A CN111782591B (en) 2020-06-22 2020-06-22 Method for calculating file similarity hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010575989.4A CN111782591B (en) 2020-06-22 2020-06-22 Method for calculating file similarity hash

Publications (2)

Publication Number Publication Date
CN111782591A true CN111782591A (en) 2020-10-16
CN111782591B CN111782591B (en) 2023-05-16

Family

ID=72756136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010575989.4A Active CN111782591B (en) 2020-06-22 2020-06-22 Method for calculating file similarity hash

Country Status (1)

Country Link
CN (1) CN111782591B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832520A (en) * 1996-07-03 1998-11-03 Miller, Call, Plauck And Miller Automatic file differencing and updating system
US20030212712A1 (en) * 2002-05-13 2003-11-13 Jinsheng Gu Byte-level file differencing and updating algorithms
US20090307251A1 (en) * 2008-06-06 2009-12-10 Steve Heller Method for reducing redundancy between two or more datasets
CN105868305A (en) * 2016-03-25 2016-08-17 西安电子科技大学 A fuzzy matching-supporting cloud storage data dereplication method
CN108595975A (en) * 2018-05-07 2018-09-28 南京信息工程大学 A kind of carrier-free information concealing method based on the retrieval of nearly multiimage
CN110414528A (en) * 2019-06-18 2019-11-05 北京计算机技术及应用研究所 A method of calculating image perception Hash

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832520A (en) * 1996-07-03 1998-11-03 Miller, Call, Plauck And Miller Automatic file differencing and updating system
US20030212712A1 (en) * 2002-05-13 2003-11-13 Jinsheng Gu Byte-level file differencing and updating algorithms
US20090307251A1 (en) * 2008-06-06 2009-12-10 Steve Heller Method for reducing redundancy between two or more datasets
CN105868305A (en) * 2016-03-25 2016-08-17 西安电子科技大学 A fuzzy matching-supporting cloud storage data dereplication method
CN108595975A (en) * 2018-05-07 2018-09-28 南京信息工程大学 A kind of carrier-free information concealing method based on the retrieval of nearly multiimage
CN110414528A (en) * 2019-06-18 2019-11-05 北京计算机技术及应用研究所 A method of calculating image perception Hash

Also Published As

Publication number Publication date
CN111782591B (en) 2023-05-16

Similar Documents

Publication Publication Date Title
US7733910B2 (en) Data segmentation using shift-varying predicate function fingerprinting
JP4263477B2 (en) System for identifying common digital sequences
EP2256934B1 (en) Method and apparatus for content-aware and adaptive deduplication
US7478113B1 (en) Boundaries
US20170286443A1 (en) Optimizing data block size for deduplication
US11163734B2 (en) Data processing method and system and client
US20120089579A1 (en) Compression pipeline for storing data in a storage cloud
US11221992B2 (en) Storing data files in a file system
US20160314141A1 (en) Compression-based filtering for deduplication
US10419557B2 (en) Identifying and managing redundant digital content transfers
CN111782591A (en) Method for calculating file similar hash
CN112182112A (en) Block chain based distributed data dynamic storage method and electronic equipment
TWI420333B (en) A distributed de-duplication system and the method therefore
US11347424B1 (en) Offset segmentation for improved inline data deduplication
US11343272B2 (en) Proof of work based on compressed video
CN110222043B (en) Data monitoring method, device and equipment of cloud storage server
CN110968575B (en) Deduplication method of big data processing system
CN117097717B (en) File transmission optimization method and system for simulation result and electronic equipment
CN114625316A (en) Content-based blocking method, system and medium applied to data de-duplication
Kim et al. Enhanced archive format using data chunking scheme
CN117156200A (en) Method, system, electronic equipment and medium for removing duplication of massive videos
CN115454714A (en) File storage backup method, device, equipment and product
CN112988366A (en) Parameter server, master client, and weight parameter processing method and system
Chapuis et al. Knowledgeable chunking
CN117370617A (en) Large-scale redundant data compression method based on xiao Ha-th

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant