CN114329465A - Storage method of massive MD5 feature codes - Google Patents

Storage method of massive MD5 feature codes Download PDF

Info

Publication number
CN114329465A
CN114329465A CN202111507523.1A CN202111507523A CN114329465A CN 114329465 A CN114329465 A CN 114329465A CN 202111507523 A CN202111507523 A CN 202111507523A CN 114329465 A CN114329465 A CN 114329465A
Authority
CN
China
Prior art keywords
data
array
file
storing
massive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111507523.1A
Other languages
Chinese (zh)
Inventor
林凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Falcon Safety Technology Co ltd
Original Assignee
Beijing Falcon Safety Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Falcon Safety Technology Co ltd filed Critical Beijing Falcon Safety Technology Co ltd
Priority to CN202111507523.1A priority Critical patent/CN114329465A/en
Publication of CN114329465A publication Critical patent/CN114329465A/en
Pending legal-status Critical Current

Links

Images

Abstract

The storage method of the massive MD5 feature codes provided by the invention comprises the following steps: reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files; establishing a linked list array, and taking the file name of the block file as data in the linked list array; respectively processing the block files, converting each MD5 character string in the block files into integer data, and establishing an array for each MD5 character string for storing the converted integer data; adding the array into the linked list array according to the file name of the partitioned file where the array is located; sequencing the arrays; and storing the sorted data. The method greatly improves the efficiency of storing and matching the mass MD5 feature codes.

Description

Storage method of massive MD5 feature codes
Technical Field
The invention relates to the technical field of password management, in particular to a storage method of massive MD5 feature codes.
Background
In the process of searching and matching computer virus files, the characteristics of the file MD5 are inevitably used for matching calculation to find virus file traces. In existing systems, storage and matching of MD5 for documents is performed using universal string matching, which is very cumbersome when dealing with hundreds of millions of MD5 signatures:
1) in the storage process of hundreds of millions of feature codes, a large amount of system memory is used, and a lot of limitations are brought to the popularization and the use of the technology;
2) the calculation mode of character string matching is adopted, so that the time consumption in a computer is long, and the quick matching of mass data is not facilitated.
Therefore, when the number of MD5 feature codes is hundreds of millions, efficient storage and matching are necessary for searching and matching computer virus files.
Disclosure of Invention
The invention provides a storage method of massive MD5 feature codes, which is used for solving the technical problem of efficient storage and matching of massive MD5 feature codes.
The embodiment of the invention is as follows:
reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files; establishing a linked list array, and taking the file name of the block file as data in the linked list array; respectively processing the block files, converting each MD5 character string in the block files into integer data, and establishing an integer array for storing the converted integer data; adding the array into the linked list array according to the file name of the partitioned file where the array is located; sequencing the arrays; and storing the sorted data.
Further, before reading the MD5 string, the MD5 data file is preprocessed and the contents of the data file are retrieved to ensure that each piece of data is a compliant MD5 string (0-10a-f), 32 bytes in length.
Further, 256 block files are named after the contents of the first two bytes of the MD5 string, namely, the file names 00, 01, 02.. a0, B0, and.. until FF, and MD5 entries in the 256 split files at this time are MD5 data beginning with a prefix of the corresponding file name.
Further, 256 raw data files are processed respectively, MD5 character strings (32 bytes) are converted into integer data of 2 × int64(16) bytes, and stored in an array in int64 format.
Furthermore, the sorting method is a quick sorting method, and other common sorting methods such as a bubble sorting method and a merge sorting method can also be adopted.
Further, in order to prevent the file from being damaged or maliciously tampered, when the sorted data is stored, signature and check information are added to the head of the file.
Further, when a data matching request is received, the position of the array is located according to the content of the first two bytes of the MD5 feature code to be matched, and the MD5 feature code to be matched is inquired in the array by adopting a dichotomy.
According to the method for storing the massive MD5 feature codes, the massive MD5 data files are partitioned by adopting the first two byte contents of the MD5 feature codes, and the MD5 feature codes are stored in a double-Int 64 integer mode, so that the efficiency of storing and matching the massive MD5 feature codes is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flow chart of a storage method of massive MD5 feature codes;
FIG. 2 is a feature code matching flow diagram of MD 5;
fig. 3 is a schematic diagram of a storage format of the massive MD5 feature code sorting data in the memory.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
According to an embodiment of the present invention, a method for storing massive MD5 feature codes is provided, where fig. 1 is a flowchart of the storage method, and includes the following steps: reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files; establishing a linked list array, and taking the file name of the block file as data in the linked list array; respectively processing the block files, converting each MD5 character string in the block files into integer data, and establishing an integer array for storing the converted integer data; adding the array into the linked list array according to the file name of the partitioned file where the array is located; sequencing the arrays; in order to avoid the situation that the data files are sorted again when the data files are loaded next time, the sorted partition data are stored in the original data format. And adds signature and verification information in the file header. In order to prevent the file from being damaged or being maliciously tampered, the header of the file is designed with related signature and check information:
Figure BDA0003403773410000031
Figure BDA0003403773410000041
the data structure occupies 1024 bytes of the head of the exported file, and relevant verification is carried out during loading so as to prevent the file from being maliciously tampered by a third party.
In a preferred embodiment of the present invention, the MD5 data file is preprocessed before the MD5 strings are read, and the contents of the data file are retrieved to ensure that each piece of data is a compliant MD5 string (0-10a-f) of 32 bytes in length.
In a preferred embodiment of the present invention, the 256 partitioned files are named after the contents of the first two bytes of the MD5 string, i.e., the file names 00, 01, 02.. a0, B0,.. until FF, and at this time, the MD5 entries in the 256 split files are all MD5 data beginning with the prefix of the corresponding file name.
In a preferred embodiment of the present invention, 256 original data files are processed separately, MD5 string (32 bytes) is converted into integer data of 2 × int64(16) bytes, and stored in an array of int64 format. For example, for example: convert the string "6 b5e4c956ccfab36b9e314e13cf35a5 c" (32 bytes long) into two integers in memory (stored with int64 array):
a[0]=0x6b5e4c956ccfab36;
a[1]=0xb9e314e13cf35a5c。
the a variable (int64[2]) is added to the linked list array numbered 6b (the first two bytes) and the next record is processed until all file processing is complete. Finally, the sorted data of partition MD5 shown in FIG. 3 will be formed in memory. According to the calculation of 2 hundred million MD5 data quantity, the whole memory accounts for:
200000000/(1024*1024*1024)*16=2.98G Bytes
for a general server configuration (memory greater than or equal to 8G), it is sufficient.
In a preferred embodiment of the present invention, the sorting method is a fast sorting method, but other common sorting methods, such as a bubble sorting method, a merge sorting method, etc., may also be used.
When receiving a data matching request, as shown in fig. 2, receiving a matching access request, performing format check on the request, and if the request is unsuccessful, returning an error; locating an internal sequencing array position according to the first byte of Md5 of the request content; performing dichotomy query in the positioned array; and returning a query matching result. The average partition queue length of MD5 with 2 hundred million storage quantity is 78000 after testing related virus samples collected by the Internet. The test is carried out on a common server, about 3.1us is needed, and the system requirement is met.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A storage method of massive MD5 feature codes is characterized by comprising the following steps:
reading an MD5 character string in an MD5 data file, classifying the MD5 character string according to the contents of the first two bytes, and generating 256 block files;
establishing a linked list array, and taking the file name of the block file as data in the linked list array;
processing the block files respectively, converting each MD5 character string in the block files into integer data, and establishing integer arrays for storing the converted integer data;
adding the array into the linked list array according to the file name of the partitioned file where the array is located;
and sequencing the arrays and storing the sequenced data.
2. The method for storing massive MD5 feature codes, according to claim 1, wherein before reading the MD5 character strings, the MD5 data files are preprocessed, and the content of the data files is retrieved to ensure that each piece of data is a compliant MD5 character string.
3. The method for storing massive MD5 feature codes according to claim 1, wherein the 256 block files are named after the contents of the first two bytes of the MD5 character string.
4. The method for storing massive MD5 feature codes according to claim 1, wherein the MD5 character strings in the block files are converted into two 16-byte data in int64 format, and the two 16-byte data are stored in the array in int64 format.
5. The method for storing the massive MD5 feature codes according to claim 1, wherein the sorting method is a quick sorting method.
6. The method for storing massive MD5 feature codes according to claim 1, wherein when the sorted data is stored, signature and verification information is added to a file header.
7. The method for storing massive MD5 feature codes, according to claim 1, wherein when a data matching request is received, the position of the array is located according to the contents of the first two bytes of the MD5 feature code to be matched, and the MD5 feature code to be matched is queried in the array by adopting a dichotomy.
CN202111507523.1A 2021-12-10 2021-12-10 Storage method of massive MD5 feature codes Pending CN114329465A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111507523.1A CN114329465A (en) 2021-12-10 2021-12-10 Storage method of massive MD5 feature codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111507523.1A CN114329465A (en) 2021-12-10 2021-12-10 Storage method of massive MD5 feature codes

Publications (1)

Publication Number Publication Date
CN114329465A true CN114329465A (en) 2022-04-12

Family

ID=81051259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111507523.1A Pending CN114329465A (en) 2021-12-10 2021-12-10 Storage method of massive MD5 feature codes

Country Status (1)

Country Link
CN (1) CN114329465A (en)

Similar Documents

Publication Publication Date Title
US7792810B1 (en) Surrogate hashing
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
US8868569B2 (en) Methods for detecting and removing duplicates in video search results
US10649997B2 (en) Method, system and computer program product for performing numeric searches related to biometric information, for finding a matching biometric identifier in a biometric database
US20090210412A1 (en) Method for searching and indexing data and a system for implementing same
JP2020182214A (en) Verification system and method for cooperation of blockchain and off-chain device
CN1531692A (en) Efficient collation element structure for handling large numbers of characters
CN111324750B (en) Large-scale text similarity calculation and text duplicate checking method
US8037069B2 (en) Membership checking of digital text
US20070174238A1 (en) Indexing and searching numeric ranges
EP2427834A1 (en) Method and system for search engine indexing and searching using the index
CN109271545A (en) A kind of characteristic key method and device, storage medium and computer equipment
CN113010477B (en) Method and device for retrieving metadata of persistent memory file system and storage structure
CN116126997B (en) Document deduplication storage method, system, device and storage medium
US20100205175A1 (en) Cap-sensitive text search for documents
CN114329465A (en) Storage method of massive MD5 feature codes
CN115294586A (en) Invoice identification method and device, storage medium and electronic equipment
CN115422125A (en) Electronic document automatic filing method and system based on intelligent algorithm
CN116450581B (en) Local quick matching method and system for white list and electronic equipment
CN113407375B (en) Database deleted data recovery method, device, equipment and storage medium
CN110347804B (en) Sensitive information detection method of linear time complexity
CN111274350B (en) Data processing method, device, computer equipment and storage medium
CN117493712A (en) PDF document navigable directory extraction method and device, electronic equipment and storage medium
US9189488B2 (en) Determination of landmarks
CN117632873A (en) File sorting method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination