CN111782591A

CN111782591A - Method for calculating file similar hash

Info

Publication number: CN111782591A
Application number: CN202010575989.4A
Authority: CN
Inventors: 蒋遂平; 姜涛; 王颖
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-16
Anticipated expiration: 2040-06-22
Also published as: CN111782591B

Abstract

The invention relates to a method for calculating file similarity hash, which comprises the following steps: the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream; assembling the resulting bit streams into a new byte stream; if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash; and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files. The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system.

Description

Method for calculating file similar hash

Technical Field

The invention relates to a cloud storage technology, in particular to a method for calculating file similar hash.

Background

With the development of cloud computing technology, the number of files that people share in cloud storage is increasing dramatically. Most of the files stored in the cloud storage are the same or similar, which wastes a large amount of network bandwidth used by uploading the files and also wastes a large amount of storage space. If the similarity of two files can be judged quickly, people can eliminate the repeated or similar files before uploading the shared files, and the bandwidth and the storage resources can be saved.

Many methods for computing the hash of a file have been proposed, such as the secure hash algorithms SHA1, SHA2, SHA3, and the like. If the two files are completely the same, the two file hashes calculated by the algorithms are consistent; if the two files have small difference, the hash values of the two files calculated by the algorithms have large difference and cannot be used for judging the similarity of the files.

For this reason, fuzzy hash methods of files, such as SSDEEP, SDBF, TLSH, etc., have been proposed, and if two files are the same or similar, the two file hashes calculated by these algorithms are the same or similar, but the calculation process is relatively complicated.

An image is a special document. Image-aware hashing methods such as block-mean hashing, discrete cosine transform hashing, and 201910526184.8, a method for computing image-aware hashing, have been proposed. The image perception hash calculation is simple and can be widely applied.

If a similar hash of a file can be computed like an image-aware hash, it is possible to process image files and general files in a more uniform manner. The present invention has been developed in response to such real needs.

Disclosure of Invention

The present invention aims to provide a method for calculating a file similarity hash, which is used for solving the problems of the prior art.

The invention discloses a method for calculating file similarity hash, which comprises the following steps: the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream; assembling the resulting bit streams into a new byte stream; if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash; and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.

An embodiment of a method for computing a file similarity hash according to the present invention is described in which if a file is similar to a hash of a fileWith n bytes from the beginning to the end and B bit stream₁,B₂,…,B_nThen converted into a bit stream b of n bits₁,b₂,…,b_nThe conversion process comprises the following steps: the method for adopting the cyclic sequential difference comprises the following steps: if B is present_i≥B_i+1Then b is_i1, otherwise b_i0, i-1, 2, …, n, when i-n, the value of i +1 is taken to be 1; the common difference method comprises the following steps: if B is present_i+B_i≥B_i-1+B_i+1Then b is_i1, otherwise b_iWhen i is 1, the value of i-1 is taken as n, and when i is n, the value of i +1 is taken as 1.

According to an embodiment of the method for calculating the file similarity hash of the present invention, the obtained bit stream is b₁,b₂,…,b_nIf the value of the number of bits n in the bitstream is not a multiple of 8, then at b_nFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'₁,B’₂,…,B’_mWherein m ═ n ÷ 8, B'_i＝b_i*8b_i*8+1b_i*8+2…b_i*8+7，i＝1,2,…,m。

According to an embodiment of the method for calculating the file similarity hash of the present invention, if the length of the new byte stream is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.

The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system.

Drawings

FIG. 1 is a flow chart of a method of computing a file hash similarity according to the present invention.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

The invention provides a method for calculating file similar hash, which comprises the following steps:

(1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream.

(2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream.

(3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.

(4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.

In the process of calculating the file similar hash, the method utilizes the cyclic sequential difference or the common difference to carry out the bit conversion of the file byte stream, can be completed by simple integer addition and subtraction operation, and is simple and quick. Meanwhile, the similar hash difference of similar files is small, the similar hash difference of different files is large, and the experience of a user on duplicate removal in a cloud storage system can be improved. Therefore, the present invention will play an important role in document retrieval and document deduplication.

FIG. 1 is a flow chart of a method of computing a file hash similarity according to the present invention. As shown in fig. 1, the method includes:

In the specific implementation, if a file has n bytes from beginning to end, the bit stream is B₁,B₂,…,B_nThen converted into a bit stream b of n bits₁,b₂,…,b_n. The conversion process is as follows:

adopting a cyclic sequential difference method: if B is present_i≥B_i+1Then b is_i1, otherwise b_iWhen i is equal to n, the value of i +1 is taken to be 1.

Adopting a common difference method: if B is present_i+B_i≥B_i-1+B_i+1Then b is_i1, otherwise b_iWhen i is 1, the value of i-1 is taken as n. When i is equal to n, the value of i +1 is taken to be 1.

In practice, if the bit stream is b₁,b₂,…,b_nIf the value of the number of bits n in the bitstream is not a multiple of 8, then at b_nFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'₁,B’₂,…,B’_mWherein m ═ n ÷ 8, B'_i＝b_i*8b_i*8+1b_i*8+2…b_i*8+7，i＝1,2,…,m。

In specific implementation, if the length of the new byte stream is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.

In specific implementation, a conventional edit distance method can be adopted for calculating the difference between the similar hash bit streams of the two files, and when the edit distance is calculated, the cost of inserting 1 bit is 1, and the cost of deleting 1 bit is 1. The smaller the editing distance of the similar hash of two files is, the more similar the corresponding two files are.

The invention discloses a method for calculating file similar hash, which comprises the following steps: (1) and a byte comparison step. And (3) regarding the file as a byte stream, comparing the size of one byte of the file with the size of the front byte and the size of the rear byte of the file, and setting a hash bit value in the middle of the file according to a comparison result to obtain the bit stream. (2) Similar to the hash assembly step. The resulting bit stream is assembled into a new byte stream. (3) And (5) carrying out the hashing step again. If the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained. (4) And (5) carrying out similar hash comparison on the files. And calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.

The method for calculating the similar hash of the file is simple in calculation, the similar hash difference of the similar file is small, the similar hash difference of different files is large, different files can be effectively distinguished, and the method has important application value in applications such as file quick retrieval and duplicate removal in a file cloud storage system. The method can meet the requirement of simply and quickly judging the similarity of the two files in the file duplication removing function of the cloud storage system.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for calculating file similarity hash, comprising:

the file is regarded as a byte stream, the size of one byte of the file and the size of the front byte and the back byte of the file are compared, and the bit value of the intermediate hash of the file is set according to the comparison result to obtain a bit stream;

assembling the resulting bit streams into a new byte stream;

if the length of the new byte stream is larger than a certain value, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, obtaining the file similar hash;

and calculating the difference of the similar hashes of the two files, and judging the similarity degree of the two files.

2. The method for computing the file similarity hash as claimed in claim 1, wherein if a file has n bytes from the beginning to the end, the bit stream is B₁,B₂,…,B_nThen converted into a bit stream b of n bits₁,b₂,…,b_nThe conversion process comprises the following steps:

the method for adopting the cyclic sequential difference comprises the following steps: if B is present_i≥B_i+1Then b is_i1, otherwise b_i0, i-1, 2, …, n, when i-n, the value of i +1 is taken to be 1;

the common difference method comprises the following steps: if B is present_i+B_i≥B_i-1+B_i+1Then b is_i1, otherwise b_iWhen i is 1, the value of i-1 is taken as n, and when i is n, the value of i +1 is taken as 1.

3. The method of claim 2, wherein the obtained bit stream is b₁,b₂,…,b_nIf the value of the number of bits n in the bitstream is not a multiple of 8, then at b_nFollowed by several bits with a value 0 so that the number of bits n of the bitstream is a multiple of 8. Every continuous 8 bits are assembled into a byte, and the bit stream is sequentially formed into a new byte stream B'₁,B’₂,…,B’_mWherein m ═ n ÷ 8, B'_i＝b_i*8b_i*8+1b_i*8+2…b_i*8+7，i＝1,2,…,m。

4. The method for computing the file similarity hash as claimed in claim 1, wherein if the new byte stream length is greater than 256, the new byte stream is regarded as a file, and the byte comparison step is returned; otherwise, the file similarity hash is obtained.