CN109165220B - Data matching calculation method - Google Patents

Data matching calculation method Download PDF

Info

Publication number
CN109165220B
CN109165220B CN201810903988.0A CN201810903988A CN109165220B CN 109165220 B CN109165220 B CN 109165220B CN 201810903988 A CN201810903988 A CN 201810903988A CN 109165220 B CN109165220 B CN 109165220B
Authority
CN
China
Prior art keywords
data
sequence
rule data
matching
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810903988.0A
Other languages
Chinese (zh)
Other versions
CN109165220A (en
Inventor
王方立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Weinute Information Technology Co ltd
Original Assignee
Tianjin Weinute Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Weinute Information Technology Co ltd filed Critical Tianjin Weinute Information Technology Co ltd
Priority to CN201810903988.0A priority Critical patent/CN109165220B/en
Publication of CN109165220A publication Critical patent/CN109165220A/en
Application granted granted Critical
Publication of CN109165220B publication Critical patent/CN109165220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data matching calculationThe method comprises the following steps: step 1, loading rule data. Step 2, constructing a compiling function cf (): step 3, allocating one size to be 2IThe/8 byte array AI is used to mark the position of the index sequence value in the array. And 4, calculating an index sequence value of each rule data through a compiling function cf (). And 5, matching rule data. The method of the invention has the advantages that: the compiling function of the method needs as little data as log (N), even if each piece of data has 4G data volume, the compiling function only needs 32 bits to participate in operation, thereby greatly reducing the calculation amount.

Description

Data matching calculation method
Technical Field
The invention relates to a data matching calculation method, and belongs to the technical field of industrial control.
Background
At present, with the continuous cross fusion of industrialization and informatization processes, more and more information technologies are applied to the industrial field. Due to the widespread adoption of general software and hardware and network facilities and the integration with enterprise management information systems, industrial control systems are becoming more and more open and data exchange with intranet or internet is taking place. Therefore, industrial control vulnerability mining and industrial white list-based industrial control firewall industrial control safety equipment is started for industrial control equipment, low delay is required when data matched with a white list is searched, and high requirements are provided for the efficiency of a matching algorithm.
Patent application with publication number CN107688605A discloses a cross-platform data matching method, which specifically comprises the following steps: receiving a data matching request sent by a terminal; acquiring group behavior data corresponding to a first user group from a first social network platform, and learning the group behavior data to obtain a group feature distribution function; acquiring the associated users of the appointed root node users and corresponding behavior data in a second social network platform; learning the behavior data of the root node user, and generating a group feature distribution function matched with the root node user; performing behavior learning on the behavior data of the associated user; calculating the maximum entropy value of the group characteristic distribution function after the associated users are matched, and determining the associated user corresponding to the maximum entropy value as the matched user of the first user group; and determining the next matching user by taking the determined matching user as the current root node user until the determined matching user meets the set quantity condition, and finishing the group matching.
Patent application publication No. CN107784057A discloses a data matching method and apparatus, the method includes: acquiring medical data stored in a database to be matched and preset word segmentation logic; performing word segmentation processing on the medical data in the standard database according to word segmentation logic to form standard medical data; performing word segmentation processing on the medical data in the database to be matched according to word segmentation logic to form medical data to be matched; matching standard medical data through the medical data to be matched; and when the medical data to be matched is matched with the standard medical data, establishing and storing the matching relation between the corresponding medical data in the database to be matched and the corresponding medical data in the standard database according to the matching relation between the medical data to be matched and the standard medical data. According to the data matching method and the data matching device, after the same word segmentation logic is adopted to segment the information of the standard database and the information of the database to be matched, the matching relation between the database to be matched and the standard database is automatically established.
In summary, in the conventional matching calculation method, hash operation is used to query data, and the hash operation has a disadvantage that the size of the hash bucket determines how much collision occurs, and too large hash bucket allocation causes stress on memory resources of the system, and collision cannot be completely eliminated.
Disclosure of Invention
The invention aims to provide a data matching calculation method capable of overcoming the technical problems, and the method comprises the following steps:
step 1, loading rule data, wherein the rule data are loaded into an array, the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.
Step 2, constructing a compiling function cf ():
step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, binary calculation is carried out, only the minimum bit number capable of distinguishing the N data is needed to be found, I represents the length of a matching sequence, and the matching sequence is realized in a binary form, namely, I represents a binary data sequence with I bits.
And 2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit of the data sequence is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated.
Step 2.3, recording the index of the difference sequence in each rule data, recording the index into an array A, compiling the function cf () function to obtain an index matching sequence through the combination of the array A, calculating, and performing
Figure BDA0001760164430000021
After permutation and combination, the optimal index matching sequence can be found,
Figure BDA0001760164430000022
which means that the number of times cf () function measurements need to be performed to find the best matching sequence.
Step 3, allocating one size to be 2IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2IArray AN for holding sequence numbers of data to be regulated, 2IThe/8 bytes are marked by one bit and used to save memory space.
And 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data.
Step 5, rule data matching:
5.1, taking a piece of rule data input externally;
5.2, obtaining the value of the index matching sequence by compiling a function cf ();
5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;
5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data through AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.
The technical terms related to the present invention are explained as follows:
hash function: translation to "hash", also direct translation to "hash", is to transform an input of arbitrary length into an output of fixed length by a hashing algorithm, the output is a hash value, i.e. the space of the hash value is smaller than the space of the input, different inputs will hash to the same output, so that a unique input value cannot be determined from the hash value; a hash function is a function that compresses a message of arbitrary length to a message digest of some fixed length.
Hash collision and resolution: when the hash values calculated by the hash function of different data are identical, a hash collision occurs, which cannot be eliminated because the hash collision is a naturally occurring defect of the hash calculation.
The hash bucket is realized based on an array and a linked list, the hash value is calculated by the hash code of the key in the HashMap, and the calculated hash value is the same as long as the hash codes are the same. When more objects are stored, the hash values calculated by different objects may be the same, and thus a hash collision occurs. The hash table is also called as a hash array, each element of the hash array is a head node of a single linked list, the linked lists are used for solving conflicts, and the single linked lists are put into the single linked lists when different keys are mapped to the same position of the array.
The linked list is a non-continuous and non-sequential storage structure on a physical storage unit, and the logical sequence of the data elements is realized by the link order of the pointers in the linked list; the linked list is made up of a series of elements, each element in the linked list being called a node, which can be dynamically generated at run-time. Each node comprises two parts: one is a data field that stores the data element and the other is a pointer field that stores the address of the next node. Because the links are not stored in sequence, the complexity of O (1) can be achieved when the links are inserted, but O (n) time is needed for searching one node or accessing a node with a specific number, and the time complexity corresponding to the linear table and the sequence table is O (logn) and O (1) respectively.
The method of the invention has the advantages that: the compiling function of the method carries out optimization processing on the matched index sequence, has intelligence and can never generate conflict, improves the matching efficiency, all data in the traditional hashing function need to participate in calculation, so that the complexity of the hashing function is rapidly increased along with the increment of the data, and the compiling function of the method needs as little data as log (N), even if each piece of data has 4G of data volume, the compiling function needs only 32 bits at least to participate in the calculation, thereby greatly reducing the calculation amount.
Detailed Description
The following describes embodiments of the present invention in detail. The method comprises the following steps:
step 1, loading rule data, wherein the rule data are loaded into an array, the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.
Step 2, constructing a compiling function cf ():
step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, binary calculation is carried out, only the minimum bit number capable of distinguishing the N data is needed to be found, I represents the length of a matching sequence, and the matching sequence is realized in a binary form, namely, I represents a binary data sequence with I bits.
And 2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit of the data sequence is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated.
Step 2.3, recording the index of the difference sequence in each rule data, recording the index into an array A, compiling the function cf () function to obtain an index matching sequence through the combination of the array A, calculating, and performing
Figure BDA0001760164430000041
After permutation and combination, the optimal index matching sequence can be found,
Figure BDA0001760164430000042
which means that the number of times cf () function measurements need to be performed to find the best matching sequence.
Step 3, allocating one size to be 2IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2IArray AN for holding sequence numbers of data to be regulated, 2IThe/8 bytes are marked by one bit and used to save memory space.
And 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data.
Step 5, rule data matching:
5.1, taking a piece of rule data input externally;
5.2, obtaining the value of the index matching sequence by compiling a function cf ();
5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;
5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data through AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the present disclosure should be covered within the scope of the present invention claimed in the appended claims.

Claims (1)

1. A data matching calculation method is characterized by comprising the following steps:
step 1, loading rule data, namely loading the rule data into an array, wherein the number of the rule data is N, the length of each rule data is D, the unit is bit, and the serial number of each data is 0,1,2.... N-1;
step 2, constructing a compiling function cf (), which comprises the following steps:
step 2.1, determining the length of the index matching sequence as I according to the number N of the rule data in step 1, wherein the unit is bit, and I is log (N), when I has a decimal, the index matching sequence is positive upwards, and when I is 7.102, I is 8; when N data differences are found, calculating by using a binary system, and only needing to find the minimum bit number capable of distinguishing the N data, wherein I represents the length of a matching sequence, and the matching sequence is realized in a binary system form, namely I represents a binary data sequence with I bits;
2.2, optimizing an index matching sequence, wherein the optimization of the index matching sequence refers to searching a data sequence with bit as a unit in the rule data loaded in the step 1, the length of the data sequence is I, the unit is bit, when the data sequence is applied to each rule data loaded in the step 1, the difference degree of the index sequence is N, when the difference degree is smaller than N, I is increased by 1, then the difference degree is recalculated, and the difference degree is defined as the number of the rest data after the N data are deduplicated;
step 2.3, record the differencesThe index of the sequence in each rule data is recorded in an array A, and the compiling function cf () function obtains an index matching sequence through the combination of the array A and calculates the index matching sequence through the steps of
Figure FDA0003001172410000011
After permutation and combination, the optimal index matching sequence can be found,
Figure FDA0003001172410000012
the number of times that cf () function measurement needs to be performed when an optimal matching sequence needs to be found;
step 3, allocating one size to be 2IThe array AI of 8 bytes is used for marking the position of the index sequence value in the array, and the unit is bit; allocate another size of 2IArray AN for storing sequence numbers of rule data, 2IThe/8 bytes are marked by one bit and used for saving the memory space;
step 4, calculating AN index sequence value of each rule data through compiling a function cf (), setting AN array AI, setting the value of AN [ cf () ] as the serial number of the current rule data, and storing the serial number of the rule data and using the AN [ cf () ] for quickly searching the rule data;
and step 5, matching rule data, comprising the following steps:
step 5.1, taking a piece of rule data input externally;
step 5.2, obtaining the value of the index matching sequence by compiling a function cf ();
step 5.3, when the corresponding bit in the group AI is 1, the input rule data exists in the loaded rule data, and the matched index value is hit; otherwise, matching fails, and matching is finished;
step 5.4, when hit is found in the step 5.3, acquiring the serial number of the rule data from the AN [ cf () ], performing memory comparison on the input rule data and the rule data with the index value of AN [ cf () ], and if the rule data are completely the same, successfully matching; otherwise the match fails.
CN201810903988.0A 2018-08-09 2018-08-09 Data matching calculation method Active CN109165220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810903988.0A CN109165220B (en) 2018-08-09 2018-08-09 Data matching calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810903988.0A CN109165220B (en) 2018-08-09 2018-08-09 Data matching calculation method

Publications (2)

Publication Number Publication Date
CN109165220A CN109165220A (en) 2019-01-08
CN109165220B true CN109165220B (en) 2021-06-22

Family

ID=64895400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810903988.0A Active CN109165220B (en) 2018-08-09 2018-08-09 Data matching calculation method

Country Status (1)

Country Link
CN (1) CN109165220B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101860531A (en) * 2010-04-21 2010-10-13 北京星网锐捷网络技术有限公司 Filtering rule matching method of data packet and device thereof

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104504038A (en) * 2014-12-15 2015-04-08 北京更快互联网技术有限公司 Hash search method for reducing hash collision
CN104965687B (en) * 2015-06-04 2017-12-08 北京东方国信科技股份有限公司 Big data processing method and processing device based on instruction set generation
US10169011B2 (en) * 2016-10-24 2019-01-01 International Business Machines Corporation Comparisons in function pointer localization
CN107291858B (en) * 2017-06-09 2021-06-08 成都索贝数码科技股份有限公司 Data indexing method based on character string suffix

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101860531A (en) * 2010-04-21 2010-10-13 北京星网锐捷网络技术有限公司 Filtering rule matching method of data packet and device thereof

Also Published As

Publication number Publication date
CN109165220A (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN108255958B (en) Data query method, device and storage medium
US10938961B1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US9104676B2 (en) Hash algorithm-based data storage method and system
US7885967B2 (en) Management of large dynamic tables
US10140351B2 (en) Method and apparatus for processing database data in distributed database system
EP3292481B1 (en) Method, system and computer program product for performing numeric searches
CN107704202B (en) Method and device for quickly reading and writing data
EP2924594A1 (en) Data encoding and corresponding data structure in a column-store database
CN111221840B (en) Data processing method and device, data caching method, storage medium and system
US11995050B2 (en) Systems and methods for sketch computation
CN111045988B (en) File searching method, device and computer program product
CN109460406B (en) Data processing method and device
CN115964002B (en) Electric energy meter terminal archive management method, device, equipment and medium
CN111209341B (en) Data storage method, device, equipment and medium of block chain
WO2021127245A1 (en) Systems and methods for sketch computation
US20210191640A1 (en) Systems and methods for data segment processing
CN111651695A (en) Method and device for generating and analyzing short link
CN112380004B (en) Memory management method, memory management device, computer readable storage medium and electronic equipment
CN109165220B (en) Data matching calculation method
CN116842012A (en) Method, device, equipment and storage medium for storing Redis cluster in fragments
CN111414527A (en) Similar item query method and device and storage medium
CN116089527A (en) Data verification method, storage medium and device
CN113285933A (en) User access control method and device, electronic equipment and storage medium
CN108984780B (en) Method and device for managing disk data based on data structure supporting repeated key value tree
CN109947775B (en) Data processing method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant