CN101398820B - Large scale key word matching method - Google Patents

Large scale key word matching method Download PDF

Info

Publication number
CN101398820B
CN101398820B CN200710122231XA CN200710122231A CN101398820B CN 101398820 B CN101398820 B CN 101398820B CN 200710122231X A CN200710122231X A CN 200710122231XA CN 200710122231 A CN200710122231 A CN 200710122231A CN 101398820 B CN101398820 B CN 101398820B
Authority
CN
China
Prior art keywords
keyword
string
text
hash
bloom filters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200710122231XA
Other languages
Chinese (zh)
Other versions
CN101398820A (en
Inventor
叶润国
周涛
华东明
孙海波
骆拥政
焦玉峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Venus Information Technology Co Ltd
Original Assignee
Beijing Venus Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Venus Information Technology Co Ltd filed Critical Beijing Venus Information Technology Co Ltd
Priority to CN200710122231XA priority Critical patent/CN101398820B/en
Publication of CN101398820A publication Critical patent/CN101398820A/en
Application granted granted Critical
Publication of CN101398820B publication Critical patent/CN101398820B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a matching method used for large-scale key words, comprising a pre-processing stage and a mode matching stage; the pre-processing stage comprises a key word characteristic string cutter, the structure of a plurality of simple bloom filter based on key word characteristic string sets, and a Hash table structure based on the key word characteristic string sets; the mode matching stage comprises the steps as follows: quick judgment that the text string in the current window is not matched with any key word characteristic string is achieved by the simple bloom filter seriesof previous structure; the precise match with candidate key words is executed under a failed judgment condition; during the text scanning process, current hash values of the current text corresponding to all simple bloom filters are quickly calculated by a recursive algorithm. The matching method sufficiently uses the characteristics that the match success rate of the text to be matched and the key words is extremely low and the recursive hash arithmetic has high efficiency, can realize the high-speed match under the condition of large-scale key words, and is extremely suitable for online virus scanning application such as virus detection and the like.

Description

A kind of large scale key word matching method
Technical field
The present invention relates to computing machine content analysis techniques field, be specifically related to a kind of multi-key word matching method of quick content analysis.
Background technology
The problem that multi-key word coupling (Multiple Pattern String Matching) solves is to judge a certain or some keyword that whether comprises in a certain data block in the keyword set fast.The multi-key word matching technique is widely used in fields such as text-processing, network content analysis, intrusion detection, information retrieval and virus detection.
The tradition multi-key word matching method comprises document [A.V.Aho, M.J.Corasick.EfficientString Matching:An Aid to Bibliographic Search, (Chinese: Communications of the ACM a kind of character string matching method efficiently that is used for directory search), 1975,18 (6): 333-340], document [S.Wu, U.Manber.A Fast Algorithm For Multi-Pattern Searching (Chinese: .TechnicalReport TR-94-17 a kind of multi-pattern matching algorithm fast), University of Arizona.1994:1-11] and document [K.G.Anagnostakis, S.Antonatos, M.Polychronakis, and E.P.Markatos.:A domain-specific string matching algorithm for intrusion detection (Chinese: a kind of field relevant for intrusion detection design multi-pattern matching algorithm) .In Proceedings of IFIPIntemational Information Security Conference (SEC ' 03), May 2003] etc.All there is a desirable application conditions in the multi-key word matching method that these documents relate to, such as, the best applications condition of Aho-Corasick method is a small-scale keyword occasion, the best applications condition of Wu-Manber is medium-scale keyword application scenario, and the best applications of E2XB is the intrusion detection occasion.These multi-key word matching methods effect under the large scale key word application scenario is unsatisfactory, and is not suitable for real-time viral detection type application scenario.Multi-key word coupling under the real-time viral detection type application scenario has following characteristics: 1) keyword quantity is very big, generally about 60,000 to 200,000; 2) length keywords is general bigger, and minimum is 8 bytes; 3) text size to be detected is bigger, does not wait to several megabyte from several kilobyte; 4) probability of success of text to be detected and any keyword coupling is low unusually.
Document [Erdogan, O.; Pei Cao, Hash-AV:fast virus signature scanning bycache-resident filters (Chinese: HASH-AV: a kind of quick virus characteristic scan method that adopts the resident filtrator of buffer memory), Global Telecommunications Conference, 2005.GLOBECOM apos; 05.IEEE Volume 3, Issue, 28 Nov.-2 Dec.2005 Page (s): 6pp.] provided a kind of multi-key word matching method that designs at viral detection type application scenario multi-key word coupling characteristics: HASH-AV, it makes up the Bloom filter (Bloom Filter) that can be contained in the modern CPU high-speed cache, and ingeniously designed one group of Bloom filter hash function, by call successively this group hash function realize current window Chinese version string not with the quick judgement of arbitrary keyword coupling.Because under the application scenarios such as virus checking, text data stream is low unusually with the probability of arbitrary keyword coupling, this in most cases quick judgement based on Bloom filter all is successful, does not need to carry out the compare operation of arm and a leg complete shut-down keyword in the time of most.Compare with other key word matching method, this key word matching method has been considered the characteristic that field of virus detection is exclusive more, detects the application scenario in virus and has shown sweep speed preferably.Utilize Bloom filter to fail to report judging a certain element does not exist when whether belonging to the designed element set, but may have wrong report, rate of false alarm was bigger when the element set of representing at Bloom filter was big especially.In theory, can reduce wrong report by the bit string size that increases Bloom filter, but in fact be difficult to be effective, because the Bloom filter hash function of constructing in the actual conditions does not have randomness preferably.The HASH-AV method adopts a Bloom filter to represent the keyword set that will search, we find in experiment, when the keyword set of searching among the HASH-AV greater than 100,000 the time, it is not higher with the rate of false alarm that any keyword coupling is judged to carry out current text based on single Bloom filter, and this has directly influenced the keyword matching efficiency of HASH-AV; Simultaneously, after each text matches window moved, the HASH-AV method need re-execute each Bloom filter hash function based on current text, and did not consider most of identical these characteristics of a current text string and a last window Chinese version string.
Summary of the invention
The present invention seeks to overcome the above-mentioned shortcoming of prior art, a kind of large scale key word matching method that is suitable for the occasion of virus detection in real time is provided, it utilize a plurality of simple bloom filters realize the current window Chinese versions not with the quick judgement of any keyword coupling, use the recurrence hash function to improve the recall precision of each Bloom filter.
The objective of the invention is to be achieved through the following technical solutions:
A kind of large scale key word matching method comprises pretreatment stage and pattern match stage, wherein,
A) described pretreatment stage may further comprise the steps:
A1, according to the keyword feature string length of setting, each keyword in the keyword set is carried out the feature string extracts, constitute the keyword feature set of strings;
A2, a plurality of simple bloom filters that comprise unique hash function and this hash function support recursive operation of structure are mapped to steps A 1 resulting keyword feature set of strings in a plurality of simple bloom filters simultaneously;
A3, Hash table of structure are mapped to the keyword feature set of strings in each unit of Hash table, and the element for having the cryptographic hash conflict is connected in series with the chained list mode;
A4, make up a linear list that comprises all primary keyses, comprise the call number of corresponding primary keys in the keyword feature string hash table of in steps A 3, setting up;
B) the described pattern match stage may further comprise the steps:
B1, a text matches window with keyword feature string equal length is set, at first with text matches window and text left-justify to be matched;
B2, be input with current text match window Chinese version string, use each simple bloom filters and relevant hash function successively, realize the current text string not with the quick judgement of any keyword feature string coupling: if successfully realize the quick eliminating of current text string is judged, then directly jump to step B5 and carry out based on a certain simple bloom filters; If failure is judged in the eliminating based on current simple bloom filters, then continue to use next simple bloom filters; Judge failure if all simple bloom filters are all carried out to get rid of, then enter step B3;
B3, according to text matches window Chinese version string search key feature string Hash table, if find the keyword feature string list item of coupling, then execution in step B4; If do not find any coupling list item, then directly jump to step B5 and carry out;
B4, read corresponding primary keys from the primary keys linear list according to the call number in the keyword feature string list item, and carry out the overall length character string relatively with current match window place text string, if the match is successful then report the keyword match event of a success; At last, no matter the match is successful and failure, all continue execution in step B5;
B5, with current text match window 1 byte that moves right, and jump to step B2 and continue to carry out, finish until whole textual scan.
Preferably, described large scale key word matching method, the steps A 1 of its pretreatment stage A is: for each keyword in the primary keys set, the keyword feature string of extraction is the minimum keyword substring of occurrence number in the whole keyword set.
Preferably, described large scale key word matching method, the simple bloom filters of structure only comprises a hash function in the steps A 2 of its pretreatment stage A, but this hash function is based on the guest sieve fingerprint polynomial module algorithm structure of recursive operation.
Preferably, described large scale key word matching method, when the steps A 3 of its pretreatment stage A was constructed keyword feature string Hash tables, the Hash table hash function of selection was the hash function of last Bloom filter in a plurality of simple bloom filters of constructing in the steps A 2.
By above technical scheme provided by the invention as can be seen, after the present invention carries out the extraction of feature string to each keyword in the keyword set, a plurality of simple bloom filters have been made up based on the keyword feature set of strings, with realize current text in the textual scan process not with the quick judgement of any keyword feature string coupling, reduced the high problem of judgement rate of false alarm that single Bloom filter causes effectively, simultaneously, the inventive method has been considered current text string and the most of identical characteristics of a last window Chinese version string in the text matches process, use the recurrence hash function to improve the recall precision of each Bloom filter, thereby improve the described large scale key word matching efficiency of the inventive method greatly.Under the equal test condition, the inventive method is all fast than HASH-AV method under different scales keyword occasion, especially under extensive long keyword occasion, and fast 0.7 times of the textual scan speed of the inventive method than HASH-AV.The inventive method is very suitable for application scenarios such as online in real time intrusion detection and virus detection.
Description of drawings
Fig. 1 is for being inserted into an element process of Bloom filter data structure;
Fig. 2 is for judging whether certain element belongs to the process of the represented set of Bloom filter;
Fig. 3 is the pretreatment stage workflow diagram of the inventive method;
Fig. 4 is the textual scan realization flow figure in the pattern match stage of the inventive method.
Embodiment
For making those of ordinary skill in the art can fully understand the present invention, introduce an important data structures-Bloom filter that uses among the present invention here earlier.
Bloom filter (Bloom Filter) is the data structure of a compression, is used for representing all elements in the set, and supports the searching of element in this set, promptly can answer " certain element belongs to certain set? " problem.
Bloom filter uses length to come expression data element set A={a as bit string (bit Vector) V of m 1, a 2... a n.Be provided with k hash function { h with even distribution character i, i=1 ..., k, satisfy following condition:
Figure 200710122231X_0
X ∈ A, h i(x) ∈ 1,2 ..., m}, it is as follows to search method based on the set element representation of Bloom filter and set element:
Set element method for expressing: for arbitrary element a in the set i, use a predefined k hash function successively to a iCarry out Hash, obtain k hashed value { b 1, b 2... b k, b i∈ [1..m] is then successively with the b of bit string V 1, b 2... b kPosition 1.Accompanying drawing 1 example an element is inserted into the process (k=3) of Bloom filter.
The set element lookup method: when needs judged whether a certain element a belongs to the set that Bloom filter represents, method was as follows: 1) use a predefined k hash function successively element a to be carried out hash operations, obtain k hashed value { b 1, b 2... b k, b i∈ [1..m]; Judge the b of bit string V then 1, b 2... b kWhether all be 1 on the position,, then represent this element in set, otherwise expression is not in set if all be 1.Accompanying drawing 2 examples based on the set element search procedure (promptly belonging to decision process) of Bloom filter.
May there be wrong report when realizing that based on Bloom filter set element is searched, still, can rate of false alarm be controlled within the acceptable scope by the length m of control bit string V.
The inventive method comprises pretreatment stage and pattern match stage.Describe the concrete implementation step in each stage of the inventive method in detail below in conjunction with accompanying drawing.
At pretreatment stage, the inventive method need be carried out pre-service to the primary keys set, and generates the data structure of several keys, with the textual scan process of auxiliary mode matching stage.
As shown in Figure 3, the implementation step of the inventive method pretreatment stage is as follows:
Step 301, according to the keyword feature string length of setting, each keyword in the keyword set is carried out the feature string extracts;
But step 302, a plurality of simple bloom filters groups that comprise unique hash function and hash function recursive calculation of structure are mapped to all keyword feature strings that extract in all Bloom filters;
Step 303, Hash table of structure are mapped to all keyword feature strings that extract in each unit of Hash table, and the element for having the cryptographic hash conflict is connected in series with the chained list mode;
Step 304, make up a linear list that comprises all primary keyses, comprise the call number of corresponding primary keys in the keyword feature string hash table of in step 303, setting up.
Embodiment 1:
Suppose to have the primary keys of K needs search, be expressed as P={P 1, P 2..., P k.In practical application, the primary keys length that needs to search is unequal.For ease of realizing the PARALLEL MATCHING of a plurality of keywords, the present invention need carry out the equal length cutting to all keywords, promptly selects a keyword substring length value W, each primary keys P among the pair set P i, it is cut to the keyword substring M of W byte length iThe keyword substring M of the W byte length that this cuts out iThe keyword feature string that is called primary keys.By each the keyword feature string M that extracts iThe set of forming is keyword feature set of strings M.Note, when selecting the keyword feature string length, the length of short keyword during W value size can not be gathered greater than primary keys.The simplest method of cutting out is to get the W byte prefix of each keyword or the suffix keyword feature string as primary keys.
After keyword feature set of strings M makes up and finishes, need K simple bloom filters of structure, and keyword feature set of strings M is mapped in this K simple bloom filters.
The present invention is based on above-described known Bloom filter building method and be configured to represent each simple bloom filters of keyword feature set of strings M:
A) the bit string V that a length is set is v, and be the selected unique hash function H of this Bloom filter 1This hash function H 1Must support recursive calculation, i.e. current window text string t it I+1... t I+m-1Hashed value H 1(t it I+1... t I+m-1) can be by previous window Chinese version string t I-1t i... t I+m-2Hashed value H 1(t I-1t i... t I+m-2) obtain through revising;
B) for each element M among the keyword feature set of strings M i, call the hash function H of simple bloom filters 1, with the bit position 1 of corresponding Hash functional value position among the Bloom filter bit string V.
In implementing the inventive method process, at the hash function of selecting for each simple bloom filters must be to support recursive operation, and the hash function that can be used for support recursive operation of the present invention comprises add operation, logic xor operation and Luo Bin fingerprint polynomial module algorithm.The hash function that it should be noted that the support recursive operation of selecting for each simple bloom filters must be different, otherwise can't realize that each simple filter current text string mates arbitrary keyword feature string and independently gets rid of judgement fast.
Accurately mate for the keyword feature string subclass that realizes text window Chinese version string and candidate, the present invention has constructed a Hash table for keyword feature set of strings M.In order to make the Hash table of structure have better balance performance, advise being homogeneity of this Hash table structure hash function H preferably, and this hash function H should have higher counting yield, to reduce the retrieval expense of Hash table.For the keyword feature string that has the hashed value conflict among the keyword feature set of strings M, the present invention adopts the chained list mode that they are connected in series, and element in the chained list is pressed the dictionary ascending order arrange.
The present invention also needs to make up a linear list L who comprises each keyword among the primary keys set P, so that text matches window Chinese version string and a certain keyword feature string are after the match is successful, can read the accurate coupling that the primary keys corresponding with this keyword feature string carries out overall length, thereby get keyword matching result to the end.Can fast find relevant primary keys at text window Chinese version string after the match is successful with a certain keyword feature string for making, comprise the call number of corresponding primary keys in the keyword feature string hash table of setting up previously of the present invention.
In the pattern match stage, the present invention will realize the quick scanning to text to be detected based on the data structure that pretreatment stage makes up.In order to make those skilled in the art person understand the present invention better, introduce the pattern matching process that the present invention relates in detail below in conjunction with accompanying drawing 4.May further comprise the steps:
Step 401, the text matches window width is set is the W byte, and wherein, W is the keyword feature string length, with text matches window and text left-justify to be matched;
Step 402: use successively each simple bloom filters realize current text string not with the quick judgement of any keyword feature string, be input promptly with current text window Chinese version string (length is the W byte), and last Hash functional value, utilize the recursive nature of hash function in the simple bloom filters, revise and obtain current hash values H 1, and check the value of corresponding bit position in this simple bloom filters bit string: if this bit place value is 0, then directly jumps to step 405 and carry out; If this bit is 1, use next simple bloom filters realize the current text string not with the quick judgement of any keyword feature string coupling; If all simple bloom filters all can't successfully realize the current text string not with the quick judgement of any keyword feature string, then represent text string may with a certain keyword feature string coupling, then continue execution in step 403.
Step 403: based on the hash function H in the keyword feature string Hash table, current text window Chinese version string is carried out hash operations, locate a certain keyword feature string subchain according to hashed value, and each keyword feature string in current window Chinese version string and the subchain accurately mated:, then continue execution in step 404 if the match is successful with certain keyword feature string; If not with subchain in any keyword feature string the match is successful, then directly jump to step 405 and carry out.
Step 404: from the keyword feature string list item that finds, take out the primary keys call number, from the primary keys linear list, load corresponding primary keys, itself and current text match window position text string are carried out based on character full coupling relatively, if the match is successful, then report the primary keys match event of a success; At last, no matter whether coupling is successful, all continue execution in step 405.
Step 405: with current text match window 1 byte that moves right, and jump to step 402 and continue to carry out, finish until whole textual scan to be detected.
When enforcement was of the present invention, the steps A 1 of pretreatment stage A of the present invention can adopt following preferred implementation: for keyword set P={P 1, P 2..., P kIn each keyword P i, the keyword feature string M of extraction iBe the minimum keyword substring of occurrence number in the whole keyword set.
The keyword feature string M that can adopt following method to make extraction iBe the minimum keyword substring of occurrence number in the whole keyword set:
A) set up a Hash table, be used to deposit the keyword substring that all possible length is W;
B) be n for arbitrary length iPrimary keys P i, can be partitioned into (n i-W) individual length is the keyword substring of W, for each keyword substring that is partitioned into, judges at first whether it has appeared in the keyword substring Hash table: if not in Hash table, then create new keyword substring list item, and Counter Value is set to 1; If the associative key list item appears in the Hash table, then the Counter Value with corresponding list item adds 1;
C) as the b set by step of all keywords in the primary keys set) after processing finished, beginning was that each primary keys is selected the minimum keyword substring of occurrence number, detailed process is: for arbitrary length is n iPrimary keys P i, search key substring Hash table is added up its (n respectively i-W) individual length is the count value of the keyword substring of W byte, selects this (n i-W) the count value reckling is this original keyword P in the individual keyword substring iThe keyword feature string.
When preferred enforcement is of the present invention, structure is a plurality of when being used to represent the simple bloom filters of keyword feature set of strings in the steps A 2 of pretreatment stage, the hash function of selecting for each simple bloom filters is a guest sieve fingerprint polynomial module algorithm, it can recursive fashion calculates the hashed value of current text window Chinese version string based on the hashed value of a last text window Chinese version string correspondence
For making those skilled in the art can grasp the structure of recurrence hash function in the inventive method, introduce guest sieve fingerprint polynomial module algorithm here.
Consider the text string that length is n, be expressed as t 1, t 2... t nFor length is the text window Chinese version string t of w byte 1, t 2..., t w, guest its sieve fingerprint F 1Computing formula be:
F 1=(t 1p W-1+ t 2p W-2+ ...+t w) modM, p and M are constant here
If want to calculate next text window Chinese version string t 2, t 3..., t W+1Corresponding guest sieve fingerprint F 2, then only need be at previous guest sieve fingerprint F 1The basis on add last multinomial coefficient t W+1And remove first multinomial coefficient t 1, i.e. F 2=(pF 1+ t W+1-t 1p W-1) modM.For improving guest sieve hashed value F 1And F 2Counting yield, can precompute one here and comprise any (t ip W-1) bivariate table as shown in table 1 of value, t wherein iValue is t i∈ [0,255], p and w value in advance determine.
Table 1 is used to improve the bivariate table (256*W unit) of guest sieve hashed value counting yield
Table 1
t iValue ?P 0 P 1 P 2 ?P w-2 ?P w-1
0 ? ? ? ? ? ?
1 ? ? ? ? ? ?
2 ? ? ? ? ? ?
? ? ? ? ? ?
254 ? ? ? ? ? ?
255 ? ? ? ? ? ?
Finish in case above-mentioned guest sieve hashed value zoom table makes up, calculating F 1And F 2Process in obtain (t by table look-at ip W-1) value.Note, for each simple bloom filters structure during based on the recurrence hash function of guest sieve fingerprint polynomial module, require into the p value of each simple bloom filters selection different.
When preferred enforcement is of the present invention, when steps A 3 structure keyword feature string Hash tables, for saving Hash table retrieval time, can directly use the hash function of last Bloom filter definition of a plurality of simple bloom filters of constructing in the steps A 2, so just save the computing time of hash function H, accelerated the Hash table recall precision.And,, therefore, will have better balance performance based on the keyword feature string Hash table of this Hash function construction because the homogeneity of the hash function relevant with last simple bloom filters is best.
Embodiment 2:
Below by an embodiment whole technical proposal of the present invention is described further.
Suppose that keyword set is combined into P={abcdefahijk, abcopqrst, wyzopqhijk}, text to be matched are bcgilmnomlmloptrstuvabc.
Preprocessing process according to the inventive method is as follows:
At first, determine keyword feature string length and reduce out the keyword feature string of each keyword correspondence.Here selecting length keywords is 6 bytes, and according to the keyword substring is minimum the feature string that principle is selected each keyword appears, the keyword feature trail that obtains at last is combined into M={bcdefg, copqrs, pqhijk} (notes, the keyword substring that principle appears in satisfied minimum may exist a plurality of, can select one of them at random in the practical application).
Then, beginning is based on 3 simple bloom filters of keyword feature set of strings M structure.The hash function that these three simple bloom filters are used all is based on guest sieve fingerprint polynomial module method construct, and wherein, the p value of first simple bloom filters hash function is 5, and the M value is 128, and bit string length is 128 bits; The p value of second simple bloom filters hash function is 7, and the M value is 128, and bit string length is 128 bits; The p value of the 3rd simple bloom filters hash function is 11, and the M value is 128, and bit string length is 128 bits.For these three simple bloom filters are ready to the recursive operation formula F respectively 2=(pF 1+ t W+1-t 1p W-1) modM, w is 6 here, and corresponding three simple bloom filters of p value are respectively 5,7,11, and the M value is 128.For improving hash function recursive operation efficient, be respectively zoom table as shown in table 1 of recurrence Hash function construction of these three simple bloom filters, here because the w that selects is 6, so each zoom table size is 256*6 unit.
Then, based on the hash function H of the 3rd simple bloom filters 1Structure keyword feature string Hash table.To comprise all elements in the keyword feature set of strings in this Hash table, i.e. M={bcdefg, copqrs, pqhijk}.
At last, be primary keys set P={abcdefghijk, abcopqrst, wyzopqhijk} constructs a linear list, and the call number of each primary keys is stored in the relevant entries of keyword feature string Hash table, such as, the call number 0 of storage primary keys abcdefghijk among the list item bcdefg, the call number 1 of storage primary keys abcopqrst among the list item copqrs, the call number 2 of storage primary keys wyzopqhijk among the list item pqhijk.
Pattern matching process according to the inventive method is as follows:
At first, determine that the text matches window width is 6 bytes, with text matches window and text bcgilmnomlmloptrstuvabc to be matched left-justify.
For current text match window Chinese version bcgilm, at first use first simple bloom filters that current text string bcgilm is mated to get rid of and judge, promptly earlier based on the hash function formula F 1=(t 1p W-1+ t 2p W-2+ ...+t w) modM and Luo Bin calculate the hashed value that zoom table calculates text string bcgilm, check the value of the corresponding bit position of this simple bloom filters bit string then, find that on inspection the corresponding bit position is 0, therefore the possibility of having got rid of current text string and arbitrary keyword feature string coupling, not needing to continue to use other two simple bloom filters to mate to get rid of has judged, the byte that directly the current text match window moved right, current text match window Chinese version string becomes cgilmn.At first, use first simple bloom filters that current text string cgilmn is mated and get rid of to judge, that use when calculating the Hash functional value of first simple bloom filters here is recursive operation function, i.e. F 2=(pF 1+ t W+1-t 1p W-1) modM.Check the value of the corresponding bit position of first simple bloom filters bit string then, be found to be 1, need to use other Bloom filter further to mate to get rid of and judge.Because this is the hash function that calls second simple bloom filters for the first time, therefore, earlier by the standard Hash functions formula F 1=(t 1p W-1+ t 2p W-2+ ...+t w) modM and Luo Bin calculate the hashed value that zoom table calculates text string cgilmn, check the pairing bit value situation of hashed value in this Bloom filter Bit String then, find that its value is 1, expression is got rid of the judgement failure based on the text string coupling of second simple bloom filters, needs to use the 3rd simple bloom filters to carry out the eliminating of text string coupling and judges.Because this is the hash function that calls the 3rd simple bloom filters for the first time, therefore, earlier by the standard Hash functions formula F 1=(t 1p W-1+ t 2p W-2+ ...+t w) modM and Luo Bin calculate the hashed value that zoom table calculates text string cgilmn, check the pairing bit value situation of hashed value in this Bloom filter Bit String then, find that its value is 0, expression is judged successfully based on the text string coupling eliminating of the 3rd simple bloom filters, the text matches window byte that directly moves right, the current text string becomes gilmno, the text string coupling based on three simple bloom filters that begins third round is then got rid of decision process, so repeatedly, up to the whole text string end of scan.
When utilizing simple bloom filters to realize the current text string not with the quick judgement of any keyword feature string coupling, make full use of guest sieve hash and calculate zoom table, and current text string hashed value can be carried out these characteristics of simple modifications based on a last current text string hashed value, make that the computing cost of each simple bloom filters Hash functional value is very little, improved the overall performance of large scale key word matching method of the present invention greatly.
Though described the present invention by embodiment, those of ordinary skills know, the present invention has many distortion and variation and do not break away from spirit of the present invention, wish that appended claim comprises these distortion and variation and do not break away from spirit of the present invention.

Claims (4)

1. a large scale key word matching method comprises pretreatment stage and pattern match stage, it is characterized in that may further comprise the steps:
A) described pretreatment stage may further comprise the steps:
A1, according to the keyword feature string length of setting, each keyword in the keyword set is carried out the feature string extracts, constitute the keyword feature set of strings;
A2, a plurality of simple bloom filters that only comprise a hash function and this hash function support recursive operation of structure are mapped to steps A 1 resulting keyword feature set of strings in a plurality of simple bloom filters simultaneously;
A3, Hash table of structure are mapped to the keyword feature set of strings in each unit of Hash table, and the element for having the cryptographic hash conflict is connected in series with the chained list mode;
A4, make up a linear list that comprises all primary keyses, comprise the call number of corresponding primary keys in the keyword feature string hash table of setting up in the steps A 3;
B) the described pattern match stage may further comprise the steps:
B1, a text matches window with keyword feature string equal length is set, at first with text matches window and text left-justify to be matched;
B2, be input with current text match window Chinese version string, use each simple bloom filters and hash function thereof successively, realize the current text string not with the quick judgement of any keyword feature string coupling: if successfully realize the quick eliminating of current text string is judged, then directly jump to step B5 and carry out based on a certain simple bloom filters; If failure is judged in the eliminating based on current simple bloom filters, then continue to use next simple bloom filters; Judge failure if all simple bloom filters are all carried out to get rid of, then enter step B3;
B3, according to text matches window Chinese version string search key feature string Hash table, if find the keyword feature string list item of coupling, then execution in step B4; If do not find any coupling list item, then directly jump to step B5 and carry out;
B4, read corresponding primary keys from the primary keys linear list according to the call number in the keyword feature string list item, and carry out the overall length character string relatively with current match window place text string, if the match is successful then report the keyword match event of a success; At last, no matter the match is successful and failure, all continue execution in step B5;
B5, with current text match window 1 byte that moves right, and jump to step B2 and continue to carry out, finish until whole textual scan.
2. large scale key word matching method according to claim 1, it is characterized in that, the steps A 1 of described pretreatment stage A is: for each keyword in the primary keys set, the keyword feature string of extraction is the minimum keyword substring of occurrence number in the whole keyword set.
3. large scale key word matching method according to claim 1 is characterized in that, the simple bloom filters of structure only comprises a hash function in steps A 2, but this hash function is based on the guest sieve fingerprint polynomial module algorithm structure of recursive operation.
4. large scale key word matching method according to claim 1, it is characterized in that, when steps A 3 structure keyword feature string Hash tables, the Hash table hash function of selection is the hash function of last Bloom filter in a plurality of simple bloom filters of constructing in the steps A 2.
CN200710122231XA 2007-09-24 2007-09-24 Large scale key word matching method Expired - Fee Related CN101398820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200710122231XA CN101398820B (en) 2007-09-24 2007-09-24 Large scale key word matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200710122231XA CN101398820B (en) 2007-09-24 2007-09-24 Large scale key word matching method

Publications (2)

Publication Number Publication Date
CN101398820A CN101398820A (en) 2009-04-01
CN101398820B true CN101398820B (en) 2010-11-17

Family

ID=40517383

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200710122231XA Expired - Fee Related CN101398820B (en) 2007-09-24 2007-09-24 Large scale key word matching method

Country Status (1)

Country Link
CN (1) CN101398820B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887436B (en) * 2009-05-12 2013-08-21 阿里巴巴集团控股有限公司 Retrieval method and device
US8396873B2 (en) * 2010-03-10 2013-03-12 Emc Corporation Index searching using a bloom filter
US9667713B2 (en) * 2011-03-21 2017-05-30 Apple Inc. Apparatus and method for managing peer-to-peer connections between different service providers
CN102426836B (en) * 2011-08-25 2013-03-20 哈尔滨工业大学 Rapid keyword detection method based on quantile self-adaption cutting
CN102546299B (en) * 2012-01-09 2014-07-16 北京锐安科技有限公司 Method for detecting deep packet under large flow
US8886827B2 (en) * 2012-02-13 2014-11-11 Juniper Networks, Inc. Flow cache mechanism for performing packet flow lookups in a network device
CN103714056A (en) * 2012-09-28 2014-04-09 深圳市微讯移通信息技术有限公司 Keyword/sensitive work filter method based on background programs
CN103186669B (en) * 2013-03-21 2018-07-06 厦门雅迅网络股份有限公司 Keyword fast filtering method
CN103440249A (en) * 2013-07-23 2013-12-11 南京烽火星空通信发展有限公司 System and method for rapidly searching unstructured data
CN103544208B (en) * 2013-08-16 2016-07-06 东软集团股份有限公司 The matching process of massive feature cluster set and system
CN104317795A (en) * 2014-08-28 2015-01-28 华为技术有限公司 Two-dimensional filter generation method, query method and device
US20160267072A1 (en) * 2015-03-12 2016-09-15 Microsoft Technology Licensing, Llc Context sensitive phrase identification
US10216748B1 (en) 2015-09-30 2019-02-26 EMC IP Holding Company LLC Segment index access management in a de-duplication system
CN107870925B (en) * 2016-09-26 2021-08-20 华为技术有限公司 Character string filtering method and related device
CN106599097B (en) * 2016-11-24 2021-06-25 东软集团股份有限公司 Matching method and device for mass feature string set
CN106776965B (en) * 2016-12-05 2019-11-26 东软集团股份有限公司 The group technology and device of feature set of strings
CN108764840A (en) * 2018-04-11 2018-11-06 哈尔滨工程大学 A kind of mail address matching process of magnanimity scale
CN113139379B (en) * 2020-01-20 2023-12-22 中国电信股份有限公司 Information identification method and system
CN112532598B (en) * 2020-11-19 2021-10-26 南京大学 Filtering method for real-time intrusion detection system
CN113051566B (en) * 2021-03-29 2023-07-14 深信服科技股份有限公司 Virus detection method and device, electronic equipment and storage medium
CN113051568A (en) * 2021-03-29 2021-06-29 深信服科技股份有限公司 Virus detection method and device, electronic equipment and storage medium
CN113051567A (en) * 2021-03-29 2021-06-29 深信服科技股份有限公司 Virus detection method and device, electronic equipment and storage medium
CN116028609B (en) * 2023-02-14 2024-02-27 成都卓讯云网科技有限公司 Multi-keyword matching method and equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1510592A (en) * 2002-12-26 2004-07-07 中国科学院计算技术研究所 Key word matching specifications for rapid network fluid characteristic test
CN101030221A (en) * 2007-04-13 2007-09-05 清华大学 Large-scale and multi-key word matching method for text or network content analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1510592A (en) * 2002-12-26 2004-07-07 中国科学院计算技术研究所 Key word matching specifications for rapid network fluid characteristic test
CN101030221A (en) * 2007-04-13 2007-09-05 清华大学 Large-scale and multi-key word matching method for text or network content analysis

Also Published As

Publication number Publication date
CN101398820A (en) 2009-04-01

Similar Documents

Publication Publication Date Title
CN101398820B (en) Large scale key word matching method
CN101359325B (en) Multi-key-word matching method for rapidly analyzing content
Chakrabarti et al. An efficient filter for approximate membership checking
CN109241274B (en) Text clustering method and device
US8015124B2 (en) Method for determining near duplicate data objects
US6754650B2 (en) System and method for regular expression matching using index
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US7739220B2 (en) Context snippet generation for book search system
Koppula et al. Learning url patterns for webpage de-duplication
SaiKrishna et al. String matching and its applications in diversified fields
CN104965905A (en) Web page classifying method and apparatus
JP2007034777A (en) Data retrieval device and method, and computer program
US20130066898A1 (en) Matching target strings to known strings
Winter et al. F2S2: Fast forensic similarity search through indexing piecewise hash signatures
CN105589894B (en) Document index establishing method and device and document retrieval method and device
CN104268157A (en) Device and method for error correction in data search
Naik et al. Fuzzy hashing aided enhanced YARA rules for malware triaging
US10565188B2 (en) System and method for performing a pattern matching search
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
WO2014071100A1 (en) Search service including indexing text containing numbers in part using one or more number index structures
Xue et al. Phishing sites detection based on Url Correlation
CN112835923A (en) Correlation retrieval method, device and equipment
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
Zhang et al. Effective and Fast Near Duplicate Detection via Signature‐Based Compression Metrics
Shang et al. Research on public opinion based on big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20101117

Termination date: 20130924