CN114329465A

CN114329465A - Storage method of massive MD5 feature codes

Info

Publication number: CN114329465A
Application number: CN202111507523.1A
Authority: CN
Inventors: 林凯
Original assignee: Beijing Falcon Safety Technology Co ltd
Current assignee: Beijing Falcon Safety Technology Co ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-12

Abstract

The storage method of the massive MD5 feature codes provided by the invention comprises the following steps: reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files; establishing a linked list array, and taking the file name of the block file as data in the linked list array; respectively processing the block files, converting each MD5 character string in the block files into integer data, and establishing an array for each MD5 character string for storing the converted integer data; adding the array into the linked list array according to the file name of the partitioned file where the array is located; sequencing the arrays; and storing the sorted data. The method greatly improves the efficiency of storing and matching the mass MD5 feature codes.

Description

Storage method of massive MD5 feature codes

Technical Field

The invention relates to the technical field of password management, in particular to a storage method of massive MD5 feature codes.

Background

In the process of searching and matching computer virus files, the characteristics of the file MD5 are inevitably used for matching calculation to find virus file traces. In existing systems, storage and matching of MD5 for documents is performed using universal string matching, which is very cumbersome when dealing with hundreds of millions of MD5 signatures:

1) in the storage process of hundreds of millions of feature codes, a large amount of system memory is used, and a lot of limitations are brought to the popularization and the use of the technology;

2) the calculation mode of character string matching is adopted, so that the time consumption in a computer is long, and the quick matching of mass data is not facilitated.

Therefore, when the number of MD5 feature codes is hundreds of millions, efficient storage and matching are necessary for searching and matching computer virus files.

Disclosure of Invention

The invention provides a storage method of massive MD5 feature codes, which is used for solving the technical problem of efficient storage and matching of massive MD5 feature codes.

The embodiment of the invention is as follows:

reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files; establishing a linked list array, and taking the file name of the block file as data in the linked list array; respectively processing the block files, converting each MD5 character string in the block files into integer data, and establishing an integer array for storing the converted integer data; adding the array into the linked list array according to the file name of the partitioned file where the array is located; sequencing the arrays; and storing the sorted data.

Further, before reading the MD5 string, the MD5 data file is preprocessed and the contents of the data file are retrieved to ensure that each piece of data is a compliant MD5 string (0-10a-f), 32 bytes in length.

Further, 256 block files are named after the contents of the first two bytes of the MD5 string, namely, the file names 00, 01, 02.. a0, B0, and.. until FF, and MD5 entries in the 256 split files at this time are MD5 data beginning with a prefix of the corresponding file name.

Further, 256 raw data files are processed respectively, MD5 character strings (32 bytes) are converted into integer data of 2 × int64(16) bytes, and stored in an array in int64 format.

Furthermore, the sorting method is a quick sorting method, and other common sorting methods such as a bubble sorting method and a merge sorting method can also be adopted.

Further, in order to prevent the file from being damaged or maliciously tampered, when the sorted data is stored, signature and check information are added to the head of the file.

Further, when a data matching request is received, the position of the array is located according to the content of the first two bytes of the MD5 feature code to be matched, and the MD5 feature code to be matched is inquired in the array by adopting a dichotomy.

According to the method for storing the massive MD5 feature codes, the massive MD5 data files are partitioned by adopting the first two byte contents of the MD5 feature codes, and the MD5 feature codes are stored in a double-Int 64 integer mode, so that the efficiency of storing and matching the massive MD5 feature codes is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flow chart of a storage method of massive MD5 feature codes;

FIG. 2 is a feature code matching flow diagram of MD 5;

fig. 3 is a schematic diagram of a storage format of the massive MD5 feature code sorting data in the memory.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

According to an embodiment of the present invention, a method for storing massive MD5 feature codes is provided, where fig. 1 is a flowchart of the storage method, and includes the following steps: reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files; establishing a linked list array, and taking the file name of the block file as data in the linked list array; respectively processing the block files, converting each MD5 character string in the block files into integer data, and establishing an integer array for storing the converted integer data; adding the array into the linked list array according to the file name of the partitioned file where the array is located; sequencing the arrays; in order to avoid the situation that the data files are sorted again when the data files are loaded next time, the sorted partition data are stored in the original data format. And adds signature and verification information in the file header. In order to prevent the file from being damaged or being maliciously tampered, the header of the file is designed with related signature and check information:

the data structure occupies 1024 bytes of the head of the exported file, and relevant verification is carried out during loading so as to prevent the file from being maliciously tampered by a third party.

In a preferred embodiment of the present invention, the MD5 data file is preprocessed before the MD5 strings are read, and the contents of the data file are retrieved to ensure that each piece of data is a compliant MD5 string (0-10a-f) of 32 bytes in length.

In a preferred embodiment of the present invention, the 256 partitioned files are named after the contents of the first two bytes of the MD5 string, i.e., the file names 00, 01, 02.. a0, B0,.. until FF, and at this time, the MD5 entries in the 256 split files are all MD5 data beginning with the prefix of the corresponding file name.

In a preferred embodiment of the present invention, 256 original data files are processed separately, MD5 string (32 bytes) is converted into integer data of 2 × int64(16) bytes, and stored in an array of int64 format. For example, for example: convert the string "6 b5e4c956ccfab36b9e314e13cf35a5 c" (32 bytes long) into two integers in memory (stored with int64 array):

a[0]＝0x6b5e4c956ccfab36；

a[1]＝0xb9e314e13cf35a5c。

the a variable (int64[2]) is added to the linked list array numbered 6b (the first two bytes) and the next record is processed until all file processing is complete. Finally, the sorted data of partition MD5 shown in FIG. 3 will be formed in memory. According to the calculation of 2 hundred million MD5 data quantity, the whole memory accounts for:

200000000/(1024*1024*1024)*16＝2.98G Bytes

for a general server configuration (memory greater than or equal to 8G), it is sufficient.

In a preferred embodiment of the present invention, the sorting method is a fast sorting method, but other common sorting methods, such as a bubble sorting method, a merge sorting method, etc., may also be used.

When receiving a data matching request, as shown in fig. 2, receiving a matching access request, performing format check on the request, and if the request is unsuccessful, returning an error; locating an internal sequencing array position according to the first byte of Md5 of the request content; performing dichotomy query in the positioned array; and returning a query matching result. The average partition queue length of MD5 with 2 hundred million storage quantity is 78000 after testing related virus samples collected by the Internet. The test is carried out on a common server, about 3.1us is needed, and the system requirement is met.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A storage method of massive MD5 feature codes is characterized by comprising the following steps:

reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files;

establishing a linked list array, and taking the file name of the block file as data in the linked list array;

processing the block files respectively, converting each MD5 character string in the block files into integer data, and establishing integer arrays for storing the converted integer data;

adding the array into the linked list array according to the file name of the partitioned file where the array is located;

and sequencing the arrays and storing the sequenced data.

2. The method for storing massive MD5 feature codes, according to claim 1, wherein before reading the MD5 character strings, the MD5 data files are preprocessed, and the content of the data files is retrieved to ensure that each piece of data is a compliant MD5 character string.

3. The method for storing massive MD5 feature codes according to claim 1, wherein the 256 block files are named after the contents of the first two bytes of the MD5 character string.

4. The method for storing massive MD5 feature codes according to claim 1, wherein the MD5 character strings in the block files are converted into two 16-byte data in int64 format, and the two 16-byte data are stored in the array in int64 format.

5. The method for storing the massive MD5 feature codes according to claim 1, wherein the sorting method is a quick sorting method.

6. The method for storing massive MD5 feature codes according to claim 1, wherein when the sorted data is stored, signature and verification information is added to a file header.

7. The method for storing massive MD5 feature codes, according to claim 1, wherein when a data matching request is received, the position of the array is located according to the contents of the first two bytes of the MD5 feature code to be matched, and the MD5 feature code to be matched is queried in the array by adopting a dichotomy.