CN110990897A

CN110990897A - File fingerprint generation method and device

Info

Publication number: CN110990897A
Application number: CN201911291250.4A
Authority: CN
Inventors: 陈德勇; 尹家奇; 邵燕
Original assignee: Beijing Wuyou Chuangxiang Information Technology Co Ltd
Current assignee: Beijing Wuyou Chuangxiang Information Technology Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-04-10

Abstract

The invention discloses a method and a device for generating a file fingerprint, wherein the method comprises the following steps: step S1, acquiring the type characteristics of the file, and taking the acquired type characteristic value as a pre-fingerprint; step S2, grouping the files, obtaining sampling file data according to a sampling point position formula, and obtaining a main body fingerprint hash value as a main body fingerprint by using a digest algorithm; step S3, counting the size of the file, converting the file size into a hash value as a post fingerprint; and step S4, splicing the front fingerprint, the main body fingerprint and the rear fingerprint to obtain a new file fingerprint.

Description

File fingerprint generation method and device

Technical Field

The invention relates to the technical field of file fingerprint algorithms, in particular to a file fingerprint generation method and device capable of meeting the scene of the requirement of rapidly calculating a file fingerprint of a large file.

Background

With the rapid development of storage technology and cloud service, the growth speed of data, particularly cloud data, is also doubled; the storage of mass data integrates a large number of various storage devices of different types in a network through application software to cooperatively work through functions of cluster application, a grid technology or a distributed file system and the like, and the functions of data storage and service access are provided for the outside together. Therefore, when the system is faced with a large data volume of a heterogeneous system, how to quickly compare and identify the content change of data and files and make corresponding feedback becomes a bottleneck for deploying large-scale services.

The existing file comparison method generally adopts a file fingerprint Algorithm MD5(Message-Digest Algorithm 5) to ensure that information transmission is complete and consistent. The MD5 file challenge algorithm is one of hash algorithms widely used by computers (abstract algorithm, hash algorithm), and the mainstream programming language generally has MD5 implementation, and operates data into a fixed length value, which is the basic principle of the hash algorithm, and includes the following steps: the MD5 processes the incoming information in 512-bit packets, each packet is divided into 16 32-bit sub-packets, and after a series of processing, the output of the algorithm consists of four 32-bit packets, and after the four 32-bit packets are concatenated, a 128-bit hash value is generated; the MD5 algorithm requires a round-robin operation on the message, the number of rounds being the number of 512-bit packets of information in the message.

As can be seen from the above MD5 algorithm steps, as the file content capacity becomes larger, the computation time and computation resources required by the MD5 algorithm will increase at a geometric speed, and for files with smaller capacity (below 1G), both the time and the computation resources required by the file fingerprint will occupy and still meet the usage requirement, but for large files (above 1G), the increase in the file capacity will result in the geometric speed increase of both the computation time and the computation resources, and for some scenarios requiring fast computation of the file fingerprint, the MD5 algorithm obviously cannot meet such scenario requirement. It can be seen that how to ensure that the computing time and computing resources for computing the file fingerprint steadily and limitedly rise under the condition of increasing the file capacity, rather than the geometric speed increase of the computing time and computing resources along with the increase of the capacity, is a problem to be solved urgently, and therefore, it is necessary to improve the MD5 file fingerprint algorithm to meet the demand scenario of fast computing the file fingerprint of a large file.

Disclosure of Invention

In order to overcome the defects of the prior art, the present invention provides a method and an apparatus for generating a file fingerprint, so as to increase the file fingerprint calculation speed and ensure the uniqueness of the file fingerprint.

In order to achieve the above object, the present invention provides a method for generating a file fingerprint, comprising the following steps:

step S1, acquiring the type characteristics of the file, and taking the acquired type characteristic value as a pre-fingerprint;

step S2, grouping the files, obtaining sampling file data according to a sampling point position formula, and obtaining a main body fingerprint hash value as a main body fingerprint by using a digest algorithm;

step S3, counting the size of the file, converting the file size into a hash value as a post fingerprint;

and step S4, splicing the front fingerprint, the main body fingerprint and the rear fingerprint to obtain a new file fingerprint.

Preferably, in step S1, extracting several bits of data of the file header of the file as the type feature of the file, expanding the numerical range of the type feature value, and taking the obtained type feature value as the pre-fingerprint;

preferably, in step S1, the first 32 bits of data in the file header of the file are extracted as the type feature of the file, and a 32-bit hash value is obtained by expanding the numerical range of the type feature value through hash calculation as the pre-fingerprint.

Preferably, in step S1, the hash formula used in the hash calculation is:

wherein s is a 32-bit characteristic value array of the file type characteristic, n is the length of the characteristic value array, and n is 32.

Preferably, in step S2, the files are grouped according to 64K size, sample file data is obtained through the sample point position formula, and a digest algorithm is used to obtain a 128-bit hash value of the main body fingerprint as the main body fingerprint.

Preferably, the sampling point position formula is as follows:

where i is the ith sample.

Preferably, in step S3, the file capacity of the file is obtained and converted into a 32-bit binary representation as the post-fingerprint.

Preferably, in step S4, the 32-bit hash value obtained in step S1 is used as a front fingerprint, the 128-bit hash value obtained in step S2 is used as a main fingerprint, and the 32-bit hash value obtained in step S3 is used as a back fingerprint to be spliced to obtain a new 192-bit hash value, so as to obtain a new file fingerprint.

Preferably, after step S4, the method further includes the following steps:

step S5, the equivalent comparison is performed on the file fingerprints generated by different files, and the files with the same file fingerprint hash value are determined to be the same file, and the files with different file fingerprint hash values are determined not to be the same file.

In order to achieve the above object, the present invention further provides a file fingerprint generating device, including:

the preposed fingerprint generating unit is used for acquiring the type characteristics of the file and taking the obtained type characteristic value as a preposed fingerprint;

the main body fingerprint generating unit is used for grouping the files, obtaining sampling file data according to a sampling point position formula and obtaining a main body fingerprint hashed value as a main body fingerprint by using a digest algorithm;

the post fingerprint generating unit is used for counting the size of the file, converting the file into a hash value and taking the hash value as a post fingerprint;

and the splicing unit is used for splicing the front fingerprint, the main body fingerprint and the rear fingerprint to obtain a new file fingerprint.

Compared with the prior art, the method and the device for generating the file fingerprint have the advantages that the type characteristics of the file are obtained, the obtained type characteristic value is used as the front fingerprint, the file is grouped, the data of the sampled file is obtained according to a sampling point position formula, the hash value of the main fingerprint is obtained by using a summary algorithm and is used as the main fingerprint, the size of the file is counted and converted into the hash value to be used as the rear fingerprint, and finally the obtained front fingerprint, the main fingerprint and the rear fingerprint are spliced to obtain the new file fingerprint.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for generating a file fingerprint according to the present invention;

FIG. 2 is a system architecture diagram of a document fingerprint generation apparatus according to the present invention;

FIG. 3 is a flowchart of a process for generating a file fingerprint according to an embodiment of the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

FIG. 1 is a flowchart illustrating steps of a method for generating a file fingerprint according to the present invention. As shown in fig. 1, the method for generating a file fingerprint of the present invention includes the following steps:

and step S1, extracting a plurality of bits of data of the file header of the file as the type characteristics of the file, expanding the numerical range of the type characteristic value, and taking the obtained type characteristic value as the pre-fingerprint. In the embodiment of the invention, the first 32 bits of data of the file header are extracted as the type characteristics of the file, the numerical range of the type characteristic value is expanded through hash calculation to obtain the 32-bit hash value as the preposed fingerprint, and the numerical range of the type characteristic value is expanded through hash calculation by extracting the first 32 bits of data of the file header as the type characteristics of the file, so that the possibility of characteristic collision under a larger value range is reduced.

In the embodiment of the present invention, the hash formula used in the hash calculation is:

s[0]*31^(n-1)+s[1]*31^(n-2)+s[2]*31^(n-3)+...+s[n-1]

the above hash formula is simplified as:

where s is a 32-bit eigenvalue array of the file type characteristic, and n is the length of the eigenvalue array, in this embodiment, n is 32.

The reason for selecting 31 as the hash multiplier in the present invention is as follows:

(1) the prime number can effectively reduce the collision rate of the hash algorithm during hash calculation.

(2) Odd numbers may retain more information than even numbers when hash calculations are performed. Since a multiple of 2 corresponds to a shift operation, if an even number is selected, overflow occurs in the multiplication operation, resulting in loss of numerical information.

(3)31 has a good property in the case of satisfying prime numbers and odd numbers, that is, the multiplication operation of 31 can be simplified into shift and subtraction operations to obtain better performance, as an example of multiplying any positive integer by 31:

31 m equivalent to (32-1) m equivalent to 32 m-m equivalent to (m <5) -m

And step S2, grouping the files, obtaining the sampled file data through a sampling formula, and obtaining the hash value of the main body fingerprint as the main body fingerprint by using a digest algorithm MD 5. In the embodiment of the invention, files are grouped according to 64K size, sampling file data is obtained through a sampling formula, and a digest algorithm MD5 is used for obtaining a 128-bit hash value of a main body fingerprint as the main body fingerprint.

Specifically, in step S2, the files are grouped according to 64K size, the sampling position information is calculated by the following sampling formula, the sampled information is re-spliced into complete information, the spliced value is calculated by the MD5 algorithm as the main body fingerprint, and the size of the main body fingerprint is 128 bits according to the MD5 characteristic.

In the embodiment of the present invention, the sampling point position is calculated by the following sampling formula:

where i is the ith sample.

And obtaining a sampling rule of the sampling file data according to the sampling position calculated by the sampling formula, and checking the large-capacity file in an incremental mode. The efficiency and accuracy are balanced by the fact that the subsequent sampling interval is correspondingly increased when the number of samples of the large-capacity file is larger.

And step S3, counting the size of the file, and converting the file into a hash value with the corresponding digits of the pre-fingerprint as the post-fingerprint. Specifically, the file size is obtained and converted into a 32-bit binary representation as the post-fingerprint.

And step S4, splicing the front fingerprint, the main body fingerprint and the rear fingerprint to obtain a new file fingerprint. In the embodiment of the present invention, the 32-bit hash value obtained in step S1 is used as a pre-fingerprint, the 128-bit hash value obtained in step S2 is used as a main fingerprint, and the 32-bit hash value obtained in step S3 is used as a post-fingerprint to be spliced to obtain a new 192-bit hash value, i.e., a new file fingerprint is obtained.

Preferably, after step S5, the method for generating a file fingerprint according to the present invention further includes the following steps:

in step S5, the file fingerprints (192-bit hash values in the embodiment of the present invention) generated by different files are compared with each other to determine that the files with the identical hash values are the same file, and that the files with different hash values are not the same file.

Fig. 2 is a system architecture diagram of a file fingerprint generation apparatus according to the present invention. As shown in fig. 2, the present invention provides a file fingerprint generation apparatus, including:

the pre-fingerprint generating unit 201 is configured to extract data of a plurality of bits of a file header of the file as a type feature of the file, expand a numerical range of a type feature value, and use the obtained type feature value as a pre-fingerprint. In the embodiment of the present invention, the pre-fingerprint generating unit 201 extracts the first 32 bits of data of the file header as the type feature of the file, expands the numerical range of the type feature value by hash calculation to obtain the 32 bits of hash value, and as the pre-fingerprint, the present invention expands the numerical range of the type feature value by hash calculation by extracting the first 32 bits of data of the file header as the type feature of the file, thereby reducing the possibility of feature collision in a larger value range.

s[0]*31^(n-1)+s[1]*31^(n-2)+s[2]*31^(n-3)+...+s[n-1]

the above hash formula is simplified as:

31 m equivalent to (32-1) m equivalent to 32 m-m equivalent to (m <5) -m

And a main body fingerprint generating unit 202, configured to group files, obtain sampled file data through a sampling formula, and obtain a main body fingerprint hash value as a main body fingerprint by using a digest algorithm MD 5. In an embodiment of the present invention, the main body fingerprint generation unit 202 groups files according to 64K size, obtains sampled file data by a sampling formula, and obtains 128-bit hash value of the main body fingerprint as the main body fingerprint by using the digest algorithm MD 5.

Specifically, the main body fingerprint generation unit 202 groups files according to 64K size, calculates sampling position information by the following sampling formula, re-splices the sampled information into complete information, calculates the spliced value as a main body fingerprint by the MD5 algorithm, and has a main body fingerprint size of 128 bits according to the MD5 characteristic.

where i is the ith sample.

The post fingerprint generating unit 203 is used for counting the size of the file, and converting the file into a hash value with the corresponding digits of the pre-fingerprint as the post fingerprint. Specifically, the post-fingerprint generating unit 203 acquires the file capacity, and converts it into a 32-bit binary representation as the post-fingerprint.

And the splicing unit 204 is used for splicing the front fingerprints, the main body fingerprints and the rear fingerprints to obtain new file fingerprints. In the embodiment of the present invention, the 32-bit hash value obtained by the pre-fingerprint generating unit 201 is used as the pre-fingerprint, the 128-bit hash value obtained by the main fingerprint generating unit 202 is used as the main fingerprint, and the 32-bit hash value obtained by the post-fingerprint generating unit 203 is used as the post-fingerprint, and the new 192-bit hash value is obtained by splicing, so as to obtain the new file fingerprint.

Preferably, the file fingerprint generating device of the present invention further includes:

the comparison unit is used for comparing equivalent values of file fingerprints (192-bit hash values in the embodiment of the invention) generated by different files, so that files with completely the same hash value can be judged to be the same file, and files with different hash values can be judged to be different files.

Examples

As shown in fig. 3, in this embodiment, the present invention is used for generating a file fingerprint of a large file, specifically, a file fingerprint generation process of a large file to be calculated is as follows:

step 1, extracting the data of the front 32 bits of the file header of the large file to be calculated as the type characteristic of the file, expanding the numerical range of the type characteristic value through hash calculation, thereby reducing the possibility of characteristic collision under a larger value range, and taking the type characteristic value obtained through the hash calculation as the preposed fingerprint.

And 2, grouping the files according to the size of 64K, calculating sampling position information through the following sampling position formula, splicing the sampled information into complete information again, calculating a spliced numerical value through an MD5 algorithm to serve as a main body fingerprint, and obtaining the size of the main body fingerprint of 128 bits according to the MD5 characteristic.

The sampling point position formula is as follows:

where i is the ith sample.

And 3, acquiring the file capacity, and converting the file capacity into a 32-bit binary representation mode to be used as the post fingerprint.

And 4, splicing the obtained 32-bit front fingerprints, 128-bit main body fingerprints and 32-bit rear fingerprints to obtain a new 192-bit file fingerprint.

The 1GB file is used as a test material, and the test material is compared by adopting an MD5 algorithm and the calculation time used by the invention, and the comparison result is as follows:

it can be seen that with the present invention, the overall process consumes 1/10 of the MD5 digest algorithm, which overall takes an order of magnitude less computational time.

In summary, the method and the device for generating the file fingerprint of the invention obtain the type characteristic value of the file as the pre-fingerprint, group the file, obtain the sampling file data according to the sampling point position formula, obtain the hash value of the main fingerprint as the main fingerprint by using the abstract algorithm, count the size of the file, convert the hash value into the post-fingerprint, and finally splice the obtained pre-fingerprint, the main fingerprint and the post-fingerprint to obtain the new file fingerprint.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A file fingerprint generation method comprises the following steps:

2. A method of generating a file fingerprint according to claim 1, wherein: in step S1, a plurality of bits of data of the file header of the file are extracted as the type feature of the file, the range of the type feature value is expanded, and the obtained type feature value is used as the pre-fingerprint.

3. A method of generating a file fingerprint according to claim 2, wherein: in step S1, the first 32 bits of data of the file header of the file are extracted as the type feature of the file, and the numerical range of the type feature value is expanded by hash calculation to obtain a 32-bit hash value as the pre-fingerprint.

4. A method for generating a file fingerprint according to claim 3, wherein in step S1, the hash formula used in the hash calculation is:

5. A method of generating a file fingerprint according to claim 2, wherein: in step S2, the files are grouped according to 64K size, sample file data is obtained through the sample point position formula, and a digest algorithm is used to obtain a 128-bit hash value of the main body fingerprint as the main body fingerprint.

6. The method for generating a file fingerprint according to claim 5, wherein the sampling point position formula is as follows:

where i is the ith sample.

7. A method for generating a file fingerprint according to claim 5, wherein: in step S3, the file capacity of the file is obtained and converted into 32-bit binary representation as the post-fingerprint.

8. A method for generating a file fingerprint according to claim 7, wherein: in step S4, the 32-bit hash value obtained in step S1 is used as the front fingerprint, the 128-bit hash value obtained in step S2 is used as the main fingerprint, and the 32-bit hash value obtained in step S3 is used as the back fingerprint to be spliced to obtain a new 192-bit hash value, so as to obtain a new file fingerprint.

9. The method for generating a file fingerprint according to claim 8, further comprising the following steps after step S4:

10. An apparatus for generating a file fingerprint, comprising: