CN111782591A - Method for calculating file similar hash - Google Patents
Method for calculating file similar hash Download PDFInfo
- Publication number
- CN111782591A CN111782591A CN202010575989.4A CN202010575989A CN111782591A CN 111782591 A CN111782591 A CN 111782591A CN 202010575989 A CN202010575989 A CN 202010575989A CN 111782591 A CN111782591 A CN 111782591A
- Authority
- CN
- China
- Prior art keywords
- file
- byte
- hash
- stream
- similar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
- G06F16/137—Hash-based
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for calculating file similarity hash, which comprises the following steps: the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream; assembling the resulting bit streams into a new byte stream; if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash; and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files. The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system.
Description
Technical Field
The invention relates to a cloud storage technology, in particular to a method for calculating file similar hash.
Background
With the development of cloud computing technology, the number of files that people share in cloud storage is increasing dramatically. Most of the files stored in the cloud storage are the same or similar, which wastes a large amount of network bandwidth used by uploading the files and also wastes a large amount of storage space. If the similarity of two files can be judged quickly, people can eliminate the repeated or similar files before uploading the shared files, and the bandwidth and the storage resources can be saved.
Many methods for computing the hash of a file have been proposed, such as the secure hash algorithms SHA1, SHA2, SHA3, and the like. If the two files are completely the same, the two file hashes calculated by the algorithms are consistent; if the two files have small difference, the hash values of the two files calculated by the algorithms have large difference and cannot be used for judging the similarity of the files.
For this reason, fuzzy hash methods of files, such as SSDEEP, SDBF, TLSH, etc., have been proposed, and if two files are the same or similar, the two file hashes calculated by these algorithms are the same or similar, but the calculation process is relatively complicated.
An image is a special document. Image-aware hashing methods such as block-mean hashing, discrete cosine transform hashing, and 201910526184.8, a method for computing image-aware hashing, have been proposed. The image perception hash calculation is simple and can be widely applied.
If a similar hash of a file can be computed like an image-aware hash, it is possible to process image files and general files in a more uniform manner. The present invention has been developed in response to such real needs.
Disclosure of Invention
The present invention aims to provide a method for calculating a file similarity hash, which is used for solving the problems of the prior art.
The invention discloses a method for calculating file similarity hash, which comprises the following steps: the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream; assembling the resulting bit streams into a new byte stream; if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash; and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
An embodiment of a method for computing a file similarity hash according to the present invention is described in which if a file is similar to a hash of a fileWith n bytes from the beginning to the end and B bit stream1,B2,…,BnThen converted into a bit stream b of n bits1,b2,…,bnThe conversion process comprises the following steps: the method for adopting the cyclic sequential difference comprises the following steps: if B is presenti≥Bi+1Then b isi1, otherwise bi0, i-1, 2, …, n, when i-n, the value of i +1 is taken to be 1; the common difference method comprises the following steps: if B is presenti+Bi≥Bi-1+Bi+1Then b isi1, otherwise biWhen i is 1, the value of i-1 is taken as n, and when i is n, the value of i +1 is taken as 1.
According to an embodiment of the method for calculating the file similarity hash of the present invention, the obtained bit stream is b1,b2,…,bnIf the value of the number of bits n in the bitstream is not a multiple of 8, then at bnFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'1,B’2,…,B’mWherein m ═ n ÷ 8, B'i=bi*8bi*8+1bi*8+2…bi*8+7,i=1,2,…,m。
According to an embodiment of the method for calculating the file similarity hash of the present invention, if the length of the new byte stream is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system.
Drawings
FIG. 1 is a flow chart of a method of computing a file hash similarity according to the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
The invention provides a method for calculating file similar hash, which comprises the following steps:
(1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream.
(2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream.
(3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
(4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
In the process of calculating the file similar hash, the method utilizes the cyclic sequential difference or the common difference to carry out the bit conversion of the file byte stream, can be completed by simple integer addition and subtraction operation, and is simple and quick. Meanwhile, the similar hash difference of similar files is small, the similar hash difference of different files is large, and the experience of a user on duplicate removal in a cloud storage system can be improved. Therefore, the present invention will play an important role in document retrieval and document deduplication.
FIG. 1 is a flow chart of a method of computing a file hash similarity according to the present invention. As shown in fig. 1, the method includes:
(1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream.
In the specific implementation, if a file has n bytes from beginning to end, the bit stream is B1,B2,…,BnThen converted into a bit stream b of n bits1,b2,…,bn. The conversion process is as follows:
adopting a cyclic sequential difference method: if B is presenti≥Bi+1Then b isi1, otherwise biWhen i is equal to n, the value of i +1 is taken to be 1.
Adopting a common difference method: if B is presenti+Bi≥Bi-1+Bi+1Then b isi1, otherwise biWhen i is 1, the value of i-1 is taken as n. When i is equal to n, the value of i +1 is taken to be 1.
(2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream.
In practice, if the bit stream is b1,b2,…,bnIf the value of the number of bits n in the bitstream is not a multiple of 8, then at bnFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'1,B’2,…,B’mWherein m ═ n ÷ 8, B'i=bi*8bi*8+1bi*8+2…bi*8+7,i=1,2,…,m。
(3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
In specific implementation, if the length of the new byte stream is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
(4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
In specific implementation, a conventional edit distance method can be adopted for calculating the difference between the similar hash bit streams of the two files, and when the edit distance is calculated, the cost of inserting 1 bit is 1, and the cost of deleting 1 bit is 1. The smaller the editing distance of the similar hash of two files is, the more similar the corresponding two files are.
The invention discloses a method for calculating file similar hash, which comprises the following steps: (1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream. (2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream. (3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained. (4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system. The method can meet the requirement of simply and quickly judging the similarity of the two files in the file duplication removing function of the cloud storage system.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (4)
1. A method for calculating file similarity hash, comprising:
the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream;
assembling the resulting bit streams into a new byte stream;
if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash;
and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.
2. The method for computing the file similarity hash as claimed in claim 1, wherein if a file has n bytes from the beginning to the end, the bit stream is B1,B2,…,BnThen converted into a bit stream b of n bits1,b2,…,bnThe conversion process comprises the following steps:
the method for adopting the cyclic sequential difference comprises the following steps: if B is presenti≥Bi+1Then b isi1, otherwise bi0, i-1, 2, …, n, when i-n, the value of i +1 is taken to be 1;
the common difference method comprises the following steps: if B is presenti+Bi≥Bi-1+Bi+1Then b isi1, otherwise biWhen i is 1, the value of i-1 is taken as n, and when i is n, the value of i +1 is taken as 1.
3. The method of claim 2, wherein the obtained bit stream is b1,b2,…,bnIf the value of the number of bits n in the bitstream is not a multiple of 8, then at bnFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'1,B’2,…,B’mWherein m ═ n ÷ 8, B'i=bi*8bi*8+1bi*8+2…bi*8+7,i=1,2,…,m。
4. The method for computing the file similarity hash as claimed in claim 1, wherein if the new byte stream length is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010575989.4A CN111782591B (en) | 2020-06-22 | 2020-06-22 | Method for calculating file similarity hash |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010575989.4A CN111782591B (en) | 2020-06-22 | 2020-06-22 | Method for calculating file similarity hash |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111782591A true CN111782591A (en) | 2020-10-16 |
CN111782591B CN111782591B (en) | 2023-05-16 |
Family
ID=72756136
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010575989.4A Active CN111782591B (en) | 2020-06-22 | 2020-06-22 | Method for calculating file similarity hash |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111782591B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5832520A (en) * | 1996-07-03 | 1998-11-03 | Miller, Call, Plauck And Miller | Automatic file differencing and updating system |
US20030212712A1 (en) * | 2002-05-13 | 2003-11-13 | Jinsheng Gu | Byte-level file differencing and updating algorithms |
US20090307251A1 (en) * | 2008-06-06 | 2009-12-10 | Steve Heller | Method for reducing redundancy between two or more datasets |
CN105868305A (en) * | 2016-03-25 | 2016-08-17 | 西安电子科技大学 | A fuzzy matching-supporting cloud storage data dereplication method |
CN108595975A (en) * | 2018-05-07 | 2018-09-28 | 南京信息工程大学 | A kind of carrier-free information concealing method based on the retrieval of nearly multiimage |
CN110414528A (en) * | 2019-06-18 | 2019-11-05 | 北京计算机技术及应用研究所 | A method of calculating image perception Hash |
-
2020
- 2020-06-22 CN CN202010575989.4A patent/CN111782591B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5832520A (en) * | 1996-07-03 | 1998-11-03 | Miller, Call, Plauck And Miller | Automatic file differencing and updating system |
US20030212712A1 (en) * | 2002-05-13 | 2003-11-13 | Jinsheng Gu | Byte-level file differencing and updating algorithms |
US20090307251A1 (en) * | 2008-06-06 | 2009-12-10 | Steve Heller | Method for reducing redundancy between two or more datasets |
CN105868305A (en) * | 2016-03-25 | 2016-08-17 | 西安电子科技大学 | A fuzzy matching-supporting cloud storage data dereplication method |
CN108595975A (en) * | 2018-05-07 | 2018-09-28 | 南京信息工程大学 | A kind of carrier-free information concealing method based on the retrieval of nearly multiimage |
CN110414528A (en) * | 2019-06-18 | 2019-11-05 | 北京计算机技术及应用研究所 | A method of calculating image perception Hash |
Also Published As
Publication number | Publication date |
---|---|
CN111782591B (en) | 2023-05-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7733910B2 (en) | Data segmentation using shift-varying predicate function fingerprinting | |
JP4263477B2 (en) | System for identifying common digital sequences | |
EP2256934B1 (en) | Method and apparatus for content-aware and adaptive deduplication | |
US7478113B1 (en) | Boundaries | |
US20170286443A1 (en) | Optimizing data block size for deduplication | |
US11163734B2 (en) | Data processing method and system and client | |
US20120089579A1 (en) | Compression pipeline for storing data in a storage cloud | |
US11221992B2 (en) | Storing data files in a file system | |
US20160314141A1 (en) | Compression-based filtering for deduplication | |
US10419557B2 (en) | Identifying and managing redundant digital content transfers | |
CN111782591A (en) | Method for calculating file similar hash | |
CN112182112A (en) | Block chain based distributed data dynamic storage method and electronic equipment | |
TWI420333B (en) | A distributed de-duplication system and the method therefore | |
US11347424B1 (en) | Offset segmentation for improved inline data deduplication | |
US11343272B2 (en) | Proof of work based on compressed video | |
CN110222043B (en) | Data monitoring method, device and equipment of cloud storage server | |
CN110968575B (en) | Deduplication method of big data processing system | |
CN117097717B (en) | File transmission optimization method and system for simulation result and electronic equipment | |
CN114625316A (en) | Content-based blocking method, system and medium applied to data de-duplication | |
Kim et al. | Enhanced archive format using data chunking scheme | |
CN117156200A (en) | Method, system, electronic equipment and medium for removing duplication of massive videos | |
CN115454714A (en) | File storage backup method, device, equipment and product | |
CN112988366A (en) | Parameter server, master client, and weight parameter processing method and system | |
Chapuis et al. | Knowledgeable chunking | |
CN117370617A (en) | Large-scale redundant data compression method based on xiao Ha-th |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |