CN109165220B

CN109165220B - Data matching calculation method

Info

Publication number: CN109165220B
Application number: CN201810903988.0A
Authority: CN
Inventors: 王方立
Original assignee: Tianjin Weinute Information Technology Co ltd
Current assignee: Tianjin Weinute Information Technology Co ltd
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2021-06-22
Anticipated expiration: 2038-08-09
Also published as: CN109165220A

Abstract

The invention discloses a data matching calculationThe method comprises the following steps: step 1, loading rule data. Step 2, constructing a compiling function cf (): step 3, allocating one size to be 2^IThe/8 byte array AI is used to mark the position of the index sequence value in the array. And 4, calculating an index sequence value of each rule data through a compiling function cf (). And 5, matching rule data. The method of the invention has the advantages that: the compiling function of the method needs as little data as log (N), even if each piece of data has 4G data volume, the compiling function only needs 32 bits to participate in operation, thereby greatly reducing the calculation amount.

Description

Data matching calculation method

Technical Field

The invention relates to a data matching calculation method, and belongs to the technical field of industrial control.

Background

At present, with the continuous cross fusion of industrialization and informatization processes, more and more information technologies are applied to the industrial field. Due to the widespread adoption of general software and hardware and network facilities and the integration with enterprise management information systems, industrial control systems are becoming more and more open and data exchange with intranet or internet is taking place. Therefore, industrial control vulnerability mining and industrial white list-based industrial control firewall industrial control safety equipment is started for industrial control equipment, low delay is required when data matched with a white list is searched, and high requirements are provided for the efficiency of a matching algorithm.

Patent application with publication number CN107688605A discloses a cross-platform data matching method, which specifically comprises the following steps: receiving a data matching request sent by a terminal; acquiring group behavior data corresponding to a first user group from a first social network platform, and learning the group behavior data to obtain a group feature distribution function; acquiring the associated users of the appointed root node users and corresponding behavior data in a second social network platform; learning the behavior data of the root node user, and generating a group feature distribution function matched with the root node user; performing behavior learning on the behavior data of the associated user; calculating the maximum entropy value of the group characteristic distribution function after the associated users are matched, and determining the associated user corresponding to the maximum entropy value as the matched user of the first user group; and determining the next matching user by taking the determined matching user as the current root node user until the determined matching user meets the set quantity condition, and finishing the group matching.

Patent application publication No. CN107784057A discloses a data matching method and apparatus, the method includes: acquiring medical data stored in a database to be matched and preset word segmentation logic; performing word segmentation processing on the medical data in the standard database according to word segmentation logic to form standard medical data; performing word segmentation processing on the medical data in the database to be matched according to word segmentation logic to form medical data to be matched; matching standard medical data through the medical data to be matched; and when the medical data to be matched is matched with the standard medical data, establishing and storing the matching relation between the corresponding medical data in the database to be matched and the corresponding medical data in the standard database according to the matching relation between the medical data to be matched and the standard medical data. According to the data matching method and the data matching device, after the same word segmentation logic is adopted to segment the information of the standard database and the information of the database to be matched, the matching relation between the database to be matched and the standard database is automatically established.

In summary, in the conventional matching calculation method, hash operation is used to query data, and the hash operation has a disadvantage that the size of the hash bucket determines how much collision occurs, and too large hash bucket allocation causes stress on memory resources of the system, and collision cannot be completely eliminated.

Disclosure of Invention

The invention aims to provide a data matching calculation method capable of overcoming the technical problems, and the method comprises the following steps:

step 1, loading rule data, wherein the rule data are loaded into an array, the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.

Step 2, constructing a compiling function cf ():

step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, binary calculation is carried out, only the minimum bit number capable of distinguishing the N data is needed to be found, I represents the length of a matching sequence, and the matching sequence is realized in a binary form, namely, I represents a binary data sequence with I bits.

And 2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit of the data sequence is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated.

Step 2.3, recording the index of the difference sequence in each rule data, recording the index into an array A, compiling the function cf () function to obtain an index matching sequence through the combination of the array A, calculating, and performing

After permutation and combination, the optimal index matching sequence can be found,

which means that the number of times cf () function measurements need to be performed to find the best matching sequence.

Step 3, allocating one size to be 2^IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2^IArray AN for holding sequence numbers of data to be regulated, 2^IThe/8 bytes are marked by one bit and used to save memory space.

And 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data.

Step 5, rule data matching:

5.1, taking a piece of rule data input externally;

5.2, obtaining the value of the index matching sequence by compiling a function cf ();

5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;

5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data through AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.

The technical terms related to the present invention are explained as follows:

hash function: translation to "hash", also direct translation to "hash", is to transform an input of arbitrary length into an output of fixed length by a hashing algorithm, the output is a hash value, i.e. the space of the hash value is smaller than the space of the input, different inputs will hash to the same output, so that a unique input value cannot be determined from the hash value; a hash function is a function that compresses a message of arbitrary length to a message digest of some fixed length.

Hash collision and resolution: when the hash values calculated by the hash function of different data are identical, a hash collision occurs, which cannot be eliminated because the hash collision is a naturally occurring defect of the hash calculation.

The hash bucket is realized based on an array and a linked list, the hash value is calculated by the hash code of the key in the HashMap, and the calculated hash value is the same as long as the hash codes are the same. When more objects are stored, the hash values calculated by different objects may be the same, and thus a hash collision occurs. The hash table is also called as a hash array, each element of the hash array is a head node of a single linked list, the linked lists are used for solving conflicts, and the single linked lists are put into the single linked lists when different keys are mapped to the same position of the array.

The linked list is a non-continuous and non-sequential storage structure on a physical storage unit, and the logical sequence of the data elements is realized by the link order of the pointers in the linked list; the linked list is made up of a series of elements, each element in the linked list being called a node, which can be dynamically generated at run-time. Each node comprises two parts: one is a data field that stores the data element and the other is a pointer field that stores the address of the next node. Because the links are not stored in sequence, the complexity of O (1) can be achieved when the links are inserted, but O (n) time is needed for searching one node or accessing a node with a specific number, and the time complexity corresponding to the linear table and the sequence table is O (logn) and O (1) respectively.

The method of the invention has the advantages that: the compiling function of the method carries out optimization processing on the matched index sequence, has intelligence and can never generate conflict, improves the matching efficiency, all data in the traditional hashing function need to participate in calculation, so that the complexity of the hashing function is rapidly increased along with the increment of the data, and the compiling function of the method needs as little data as log (N), even if each piece of data has 4G of data volume, the compiling function needs only 32 bits at least to participate in the calculation, thereby greatly reducing the calculation amount.

Detailed Description

The following describes embodiments of the present invention in detail. The method comprises the following steps:

Step 2, constructing a compiling function cf ():

Step 5, rule data matching:

5.1, taking a piece of rule data input externally;

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the present disclosure should be covered within the scope of the present invention claimed in the appended claims.

Claims

1. A data matching calculation method is characterized by comprising the following steps:

step 1, loading rule data, namely loading the rule data into an array, wherein the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.... N-1;

step 2, constructing a compiling function cf (), which comprises the following steps:

step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, calculating by using a binary system, and only needing to find the minimum bit number capable of distinguishing the N data, wherein I represents the length of a matching sequence, and the matching sequence is realized in a binary system form, namely I represents a binary data sequence with I bits;

2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated;

step 2.3, record the differencesThe index of the sequence in each rule data is recorded in an array A, and the compiling function cf () function obtains an index matching sequence through the combination of the array A and calculates the index matching sequence through the steps of

the number of times that cf () function measurement needs to be performed when an optimal matching sequence needs to be found;

step 3, allocating one size to be 2^IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2^IArray AN for storing sequence numbers of rule data, 2^IThe/8 bytes are marked by one bit and used for saving the memory space;

step 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data;

and step 5, matching rule data, comprising the following steps:

step 5.1, taking a piece of rule data input externally;

step 5.2, obtaining the value of the index matching sequence by compiling a function cf ();

step 5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;

step 5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data from the AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value of AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.