CN109165220B - Data matching calculation method - Google Patents
Data matching calculation method Download PDFInfo
- Publication number
- CN109165220B CN109165220B CN201810903988.0A CN201810903988A CN109165220B CN 109165220 B CN109165220 B CN 109165220B CN 201810903988 A CN201810903988 A CN 201810903988A CN 109165220 B CN109165220 B CN 109165220B
- Authority
- CN
- China
- Prior art keywords
- data
- sequence
- rule data
- matching
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data matching calculationThe method comprises the following steps: step 1, loading rule data. Step 2, constructing a compiling function cf (): step 3, allocating one size to be 2IThe/8 byte array AI is used to mark the position of the index sequence value in the array. And 4, calculating an index sequence value of each rule data through a compiling function cf (). And 5, matching rule data. The method of the invention has the advantages that: the compiling function of the method needs as little data as log (N), even if each piece of data has 4G data volume, the compiling function only needs 32 bits to participate in operation, thereby greatly reducing the calculation amount.
Description
Technical Field
The invention relates to a data matching calculation method, and belongs to the technical field of industrial control.
Background
At present, with the continuous cross fusion of industrialization and informatization processes, more and more information technologies are applied to the industrial field. Due to the widespread adoption of general software and hardware and network facilities and the integration with enterprise management information systems, industrial control systems are becoming more and more open and data exchange with intranet or internet is taking place. Therefore, industrial control vulnerability mining and industrial white list-based industrial control firewall industrial control safety equipment is started for industrial control equipment, low delay is required when data matched with a white list is searched, and high requirements are provided for the efficiency of a matching algorithm.
Patent application with publication number CN107688605A discloses a cross-platform data matching method, which specifically comprises the following steps: receiving a data matching request sent by a terminal; acquiring group behavior data corresponding to a first user group from a first social network platform, and learning the group behavior data to obtain a group feature distribution function; acquiring the associated users of the appointed root node users and corresponding behavior data in a second social network platform; learning the behavior data of the root node user, and generating a group feature distribution function matched with the root node user; performing behavior learning on the behavior data of the associated user; calculating the maximum entropy value of the group characteristic distribution function after the associated users are matched, and determining the associated user corresponding to the maximum entropy value as the matched user of the first user group; and determining the next matching user by taking the determined matching user as the current root node user until the determined matching user meets the set quantity condition, and finishing the group matching.
Patent application publication No. CN107784057A discloses a data matching method and apparatus, the method includes: acquiring medical data stored in a database to be matched and preset word segmentation logic; performing word segmentation processing on the medical data in the standard database according to word segmentation logic to form standard medical data; performing word segmentation processing on the medical data in the database to be matched according to word segmentation logic to form medical data to be matched; matching standard medical data through the medical data to be matched; and when the medical data to be matched is matched with the standard medical data, establishing and storing the matching relation between the corresponding medical data in the database to be matched and the corresponding medical data in the standard database according to the matching relation between the medical data to be matched and the standard medical data. According to the data matching method and the data matching device, after the same word segmentation logic is adopted to segment the information of the standard database and the information of the database to be matched, the matching relation between the database to be matched and the standard database is automatically established.
In summary, in the conventional matching calculation method, hash operation is used to query data, and the hash operation has a disadvantage that the size of the hash bucket determines how much collision occurs, and too large hash bucket allocation causes stress on memory resources of the system, and collision cannot be completely eliminated.
Disclosure of Invention
The invention aims to provide a data matching calculation method capable of overcoming the technical problems, and the method comprises the following steps:
step 1, loading rule data, wherein the rule data are loaded into an array, the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.
Step 2, constructing a compiling function cf ():
step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, binary calculation is carried out, only the minimum bit number capable of distinguishing the N data is needed to be found, I represents the length of a matching sequence, and the matching sequence is realized in a binary form, namely, I represents a binary data sequence with I bits.
And 2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit of the data sequence is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated.
Step 2.3, recording the index of the difference sequence in each rule data, recording the index into an array A, compiling the function cf () function to obtain an index matching sequence through the combination of the array A, calculating, and performingAfter permutation and combination, the optimal index matching sequence can be found,which means that the number of times cf () function measurements need to be performed to find the best matching sequence.
Step 3, allocating one size to be 2IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2IArray AN for holding sequence numbers of data to be regulated, 2IThe/8 bytes are marked by one bit and used to save memory space.
And 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data.
Step 5, rule data matching:
5.1, taking a piece of rule data input externally;
5.2, obtaining the value of the index matching sequence by compiling a function cf ();
5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;
5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data through AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.
The technical terms related to the present invention are explained as follows:
hash function: translation to "hash", also direct translation to "hash", is to transform an input of arbitrary length into an output of fixed length by a hashing algorithm, the output is a hash value, i.e. the space of the hash value is smaller than the space of the input, different inputs will hash to the same output, so that a unique input value cannot be determined from the hash value; a hash function is a function that compresses a message of arbitrary length to a message digest of some fixed length.
Hash collision and resolution: when the hash values calculated by the hash function of different data are identical, a hash collision occurs, which cannot be eliminated because the hash collision is a naturally occurring defect of the hash calculation.
The hash bucket is realized based on an array and a linked list, the hash value is calculated by the hash code of the key in the HashMap, and the calculated hash value is the same as long as the hash codes are the same. When more objects are stored, the hash values calculated by different objects may be the same, and thus a hash collision occurs. The hash table is also called as a hash array, each element of the hash array is a head node of a single linked list, the linked lists are used for solving conflicts, and the single linked lists are put into the single linked lists when different keys are mapped to the same position of the array.
The linked list is a non-continuous and non-sequential storage structure on a physical storage unit, and the logical sequence of the data elements is realized by the link order of the pointers in the linked list; the linked list is made up of a series of elements, each element in the linked list being called a node, which can be dynamically generated at run-time. Each node comprises two parts: one is a data field that stores the data element and the other is a pointer field that stores the address of the next node. Because the links are not stored in sequence, the complexity of O (1) can be achieved when the links are inserted, but O (n) time is needed for searching one node or accessing a node with a specific number, and the time complexity corresponding to the linear table and the sequence table is O (logn) and O (1) respectively.
The method of the invention has the advantages that: the compiling function of the method carries out optimization processing on the matched index sequence, has intelligence and can never generate conflict, improves the matching efficiency, all data in the traditional hashing function need to participate in calculation, so that the complexity of the hashing function is rapidly increased along with the increment of the data, and the compiling function of the method needs as little data as log (N), even if each piece of data has 4G of data volume, the compiling function needs only 32 bits at least to participate in the calculation, thereby greatly reducing the calculation amount.
Detailed Description
The following describes embodiments of the present invention in detail. The method comprises the following steps:
step 1, loading rule data, wherein the rule data are loaded into an array, the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.
Step 2, constructing a compiling function cf ():
step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, binary calculation is carried out, only the minimum bit number capable of distinguishing the N data is needed to be found, I represents the length of a matching sequence, and the matching sequence is realized in a binary form, namely, I represents a binary data sequence with I bits.
And 2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit of the data sequence is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated.
Step 2.3, recording the index of the difference sequence in each rule data, recording the index into an array A, compiling the function cf () function to obtain an index matching sequence through the combination of the array A, calculating, and performingAfter permutation and combination, the optimal index matching sequence can be found,which means that the number of times cf () function measurements need to be performed to find the best matching sequence.
Step 3, allocating one size to be 2IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2IArray AN for holding sequence numbers of data to be regulated, 2IThe/8 bytes are marked by one bit and used to save memory space.
And 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data.
Step 5, rule data matching:
5.1, taking a piece of rule data input externally;
5.2, obtaining the value of the index matching sequence by compiling a function cf ();
5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;
5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data through AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the present disclosure should be covered within the scope of the present invention claimed in the appended claims.
Claims (1)
1. A data matching calculation method is characterized by comprising the following steps:
step 1, loading rule data, namely loading the rule data into an array, wherein the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.... N-1;
step 2, constructing a compiling function cf (), which comprises the following steps:
step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, calculating by using a binary system, and only needing to find the minimum bit number capable of distinguishing the N data, wherein I represents the length of a matching sequence, and the matching sequence is realized in a binary system form, namely I represents a binary data sequence with I bits;
2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated;
step 2.3, record the differencesThe index of the sequence in each rule data is recorded in an array A, and the compiling function cf () function obtains an index matching sequence through the combination of the array A and calculates the index matching sequence through the steps ofAfter permutation and combination, the optimal index matching sequence can be found,the number of times that cf () function measurement needs to be performed when an optimal matching sequence needs to be found;
step 3, allocating one size to be 2IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2IArray AN for storing sequence numbers of rule data, 2IThe/8 bytes are marked by one bit and used for saving the memory space;
step 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data;
and step 5, matching rule data, comprising the following steps:
step 5.1, taking a piece of rule data input externally;
step 5.2, obtaining the value of the index matching sequence by compiling a function cf ();
step 5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;
step 5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data from the AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value of AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810903988.0A CN109165220B (en) | 2018-08-09 | 2018-08-09 | Data matching calculation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810903988.0A CN109165220B (en) | 2018-08-09 | 2018-08-09 | Data matching calculation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109165220A CN109165220A (en) | 2019-01-08 |
CN109165220B true CN109165220B (en) | 2021-06-22 |
Family
ID=64895400
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810903988.0A Active CN109165220B (en) | 2018-08-09 | 2018-08-09 | Data matching calculation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109165220B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101860531A (en) * | 2010-04-21 | 2010-10-13 | 北京星网锐捷网络技术有限公司 | Filtering rule matching method of data packet and device thereof |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104504038A (en) * | 2014-12-15 | 2015-04-08 | 北京更快互联网技术有限公司 | Hash search method for reducing hash collision |
CN104965687B (en) * | 2015-06-04 | 2017-12-08 | 北京东方国信科技股份有限公司 | Big data processing method and processing device based on instruction set generation |
US10169011B2 (en) * | 2016-10-24 | 2019-01-01 | International Business Machines Corporation | Comparisons in function pointer localization |
CN107291858B (en) * | 2017-06-09 | 2021-06-08 | 成都索贝数码科技股份有限公司 | Data indexing method based on character string suffix |
-
2018
- 2018-08-09 CN CN201810903988.0A patent/CN109165220B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101860531A (en) * | 2010-04-21 | 2010-10-13 | 北京星网锐捷网络技术有限公司 | Filtering rule matching method of data packet and device thereof |
Also Published As
Publication number | Publication date |
---|---|
CN109165220A (en) | 2019-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108255958B (en) | Data query method, device and storage medium | |
US10938961B1 (en) | Systems and methods for data deduplication by generating similarity metrics using sketch computation | |
US9104676B2 (en) | Hash algorithm-based data storage method and system | |
US7885967B2 (en) | Management of large dynamic tables | |
US10140351B2 (en) | Method and apparatus for processing database data in distributed database system | |
EP3292481B1 (en) | Method, system and computer program product for performing numeric searches | |
CN107704202B (en) | Method and device for quickly reading and writing data | |
EP2924594A1 (en) | Data encoding and corresponding data structure in a column-store database | |
CN111221840B (en) | Data processing method and device, data caching method, storage medium and system | |
US11995050B2 (en) | Systems and methods for sketch computation | |
CN111045988B (en) | File searching method, device and computer program product | |
CN109460406B (en) | Data processing method and device | |
CN115964002B (en) | Electric energy meter terminal archive management method, device, equipment and medium | |
CN111209341B (en) | Data storage method, device, equipment and medium of block chain | |
WO2021127245A1 (en) | Systems and methods for sketch computation | |
US20210191640A1 (en) | Systems and methods for data segment processing | |
CN111651695A (en) | Method and device for generating and analyzing short link | |
CN112380004B (en) | Memory management method, memory management device, computer readable storage medium and electronic equipment | |
CN109165220B (en) | Data matching calculation method | |
CN116842012A (en) | Method, device, equipment and storage medium for storing Redis cluster in fragments | |
CN111414527A (en) | Similar item query method and device and storage medium | |
CN116089527A (en) | Data verification method, storage medium and device | |
CN113285933A (en) | User access control method and device, electronic equipment and storage medium | |
CN108984780B (en) | Method and device for managing disk data based on data structure supporting repeated key value tree | |
CN109947775B (en) | Data processing method and device, electronic equipment and computer readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |