CN101030221A

CN101030221A - Large-scale and multi-key word matching method for text or network content analysis

Info

Publication number: CN101030221A
Application number: CN 200710065392
Authority: CN
Inventors: 周宗伟; 薛一波
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2007-04-13
Filing date: 2007-04-13
Publication date: 2007-09-05
Anticipated expiration: 2027-04-13
Also published as: CN100452055C

Abstract

A large-capacity multikey-word matching method used on content analysis of text or network includes setting up jump table and key word table, calculating jump value of each jump table item and correlating key word with key word table item, carrying out Hash operation on data block in window, indexing jump value in jump table and moving window as per said value if jump value is not zero or otherwise making Hash operation again data block, indexing key word table and comparing key word correlated with said table item with field in text in sequence for confirming whether they are matched with each other or not.

Description

A kind of large-scale and multi-key word matching method that is used for text or network content analysis

Technical field

The present invention relates to a kind of large-scale and multi-key word matching method that is used for text or network content analysis, relate in particular to the text under the large scale key word collection background or the treatment technology of Web content, belong to the microcomputer data processing field.

Background technology

The multi-key word coupling is one of basic problem in the computer science.The problem of its solution is to judge certain or some keyword that whether comprises in a certain text or the data block in the given keyword set rapidly and accurately.The multi-key word matching technique has been widely applied to the every field of network securitys such as fire wall, viral detection, intrusion detection and defence, information filtering now, simultaneously can also expand to other subjects, such as information management system, network search engines, the gene order in the bioinformatics detects or the like.Therefore research and the improvement for multi-key word matching method is of practical significance very much.

One of classical way that solves the multi-key word matching problem is based on the method for Hash skip list.This method is proposed in 1994 by the Sun Wu of Taiwan National Chung Cheng University and the Udi Manber of the upright university of State of Arizona, US the earliest, is named after its inventor usually, hereinafter to be referred as " WM method ".The basic ideas of WM method are to calculate that size is the corresponding jump value of the data block of B byte (B is generally 2 or 3) in the text in when coupling, are worth mobile text or the further checking keyword that may mate according to jump.The WM method is come specified data piece jump value by certain heuristic rule, avoids some unnecessary checkings, has adopted the method for Hash hash simultaneously in the foundation of jump value table, has accelerated matching speed.

The same with general multi-key word matching method, the WM method also is divided into pre-service and two stages of coupling.The WM method can be set up three tables in the pretreatment stage: Hash table, skip list and prefix table.Wherein, stored the jump value of corresponding data piece in the skip list, it is used to determine in the time of scan text have what characters to be skipped in the text.Hash table and prefix table are which keyword whether might mate with text generating coupling and affirmation in the keyword set with deciding when the jump tabular value is 0.If given keyword set be combined into her, his, she, the kiss} and the block size B=2 that fetches data, then Dui Ying skip list, Hash table and prefix table are as shown in Figure 1.In the text matches process, length is that the window of B places text to begin the place, the WM method is calculated the cryptographic hash of B character in the current window and is retrieved corresponding list item in the skip list, if jump is worth non-vanishing then moving window in the list item, then retrieve Hash table with same cryptographic hash again when being zero, find the keyword that may mate, filter out wherein a part with prefix table, the keyword that can't filter out is again with the method for character match one by one and judge whether to occur in this position coupling.

The WM method has very outstanding average behavior, matching speed is fast, the memory headroom that takies is less, it is a kind of method that on time and space, behaves oneself best in the present multi-key word coupling field, thereby obtained using widely, for example Linux order agrep, and the keyword matching module among the famous intruding detection system Snort that increases income.

It should be noted that the develop rapidly of following Internet technology, the research of multi-key word matching method has run into new challenge.Attack complicated day by day, network security key in application speech scale also constantly enlarge thereupon.According to the information of CNCERT/CC, present Web content and sensitive information filter used rule set and generally are several ten thousand, even reach tens0000.The virus characteristic storehouse clauses and subclauses of the famous anti-viral software clamAV that increases income have reached 49644, but also are constantly increasing.Based on harmonious present situation, the defined large scale key word collection of the present invention refers to that the keyword number reaches more than 10,000 in the set.Simultaneously, along with the network bandwidth also constantly increases, the processing power of network security protection system also must improve constantly.As core technology, this means that also the multi-key word matching performance also will improve constantly.The fact is the performance of most of network safety systems, and especially some real time processing systems are filtered or the like the application demand of also having no idea to satisfy as virus detection, traffic statistics analysis, sensitive information.More than two present situations disclosed the necessity of carrying out towards the high speed multi-key word matching method research of large scale key word collection.

But, can know by analysis for the WM algorithm flow, after keyword quantity increases, the jump value is that zero item number can increase in the skip list of WM method, average jump value can reduce thereupon, thereby can't avoid characters unnecessary in a large number in the matching process to compare effectively, cause the matching speed of text or Web content to descend.Experimental result shows that the WM method reaches in 100,000 at the keyword number, and matching speed descends very obvious, and extremely low handling capacity has been difficult to satisfy the practical requirement of text or Web content processing.

Summary of the invention

The objective of the invention is to propose a kind of large-scale and multi-key word matching method that is used for text or network content analysis.This method proposes dynamically to cut the thought of keyword first, fully excavate the heuristic information of keyword set, overcome the WM method under the large scale key word collection in the skip list jump value be zero the too much shortcoming of item, significantly improve the matching speed under the large scale key word collection.This method is passed through to introduce the two stages of compression Hash table simultaneously, and removes the prefix table of using in the WM method under keeping the prerequisite of matching performance, has reduced the consumption of memory headroom, has guaranteed good hardware extensibility, has promoted practical value.

This method is used for the large-scale and multi-key word matching method of text or network content analysis, comprises following each step:

(1) determine in the keyword set length m of short keyword, m is the positive integer more than or equal to 4, sets up skip list and antistop list;

(2) window that is m with a size places the place that begins of above-mentioned first keyword of keyword set, the data block that last B the character that enters in the window formed is carried out first Hash operation, wherein B is 3 or 4, retrieve above-mentioned skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then window is moved a character, repeating step (2), if zero, then enter step (3);

(3) data block that last B character in the above-mentioned window formed is carried out second Hash operation, retrieve above-mentioned antistop list with the cryptographic hash that obtains, if the keyword number of corresponding list item is non-vanishing, then window is moved a character, repeating step (2), if zero, then current window is optimum m window, and write down optimum m window and this keyword side-play amount P between beginning to locate, the value of P equals window and moves number of times, if window arrives first keyword low order end, preceding m the character of then specifying this keyword is optimum m window, and record shift quantity is zero;

(4) repeating step (2) and (3) obtain optimum m window and corresponding offset thereof in each keyword;

(5) be the data block of B according to all length in the optimum m window of all keywords in the above-mentioned keyword set and each window, revise the jump value in above-mentioned skip list and the antistop list, and keyword is associated with in the antistop list list item, obtain analyzing with skip list and analyze and use antistop list;

(6) window that is B with a size places the place that begins of text to be analyzed or Web content, data in the window are carried out the 3rd Hash operation, retrieve above-mentioned analysis skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then according to jump value moving window, if the jump value is zero, then this data block is carried out the 4th Hash operation, retrieve above-mentioned analysis antistop list with the cryptographic hash that obtains, if be associated with one or more keywords that may mate in the corresponding list item, then successively keyword and text to be analyzed or the respective field in the Web content are carried out character relatively, and, arrive the text to be analyzed or the least significant end of Web content up to window according to the jump value moving window in this antistop list list item.

In the said method, successively keyword and text to be analyzed or the respective field in the Web content are carried out character process relatively, may further comprise the steps:

(1) establishing the distance that first character of current B window and text to be analyzed or Web content begin to locate is T, obtain field to be compared in text to be analyzed or the Web content begin locate with text to be analyzed or Web content begin to locate apart from T-L, be the position that field to be compared begins to locate, L=m-B-P wherein, P are the side-play amount of the optimum m window of first keyword;

(2) first character of first keyword and field to be compared are begun the place align after one by one character compare, if coupling is then exported the keyword number of matched position T-L and this keyword, and is entered step (3), if do not match, then directly enter step (3);

(3) repeating step (1), (2), up to relevant keyword relatively finish.

The large-scale and multi-key word matching method that is used for text or network content analysis that the present invention proposes, its advantage is as follows:

1, algorithm evaluation and test excellent performance: through experiment, the high speed large-scale and multi-key word matching method that the present invention proposes is compared with existing WM method, and the room and time performance is all than more excellent.Under several 10,000 to 100,000 the situations of keyword, fast 2 to 4 times of the matching speeds of this method than WM method, the memory headroom that takies remains basically stable with the WM method under the B=2 situation, but lacks 75% to 90% than the WM method under the B=3 situation.

2, satisfy practical the requirement: in the practical application, the method that this invention proposes can reach the handling capacity (being text or the Web content that mates 75MB p.s.) of 600Mbps under several 100,000 situations of keyword, and the memory headroom that takies only has only about 5MB, can satisfy the performance requirement that present most of this paper or Web content are handled well.

3, the hardware extensibility is good: lower memory headroom consumption has guaranteed that the method that the present invention proposes might be loaded in the hardware-accelerated system, or exploitation becomes special-purpose SOC (system on a chip), thereby further improve matching speed, and have good market outlook.

Description of drawings

Fig. 1 is in the existing WM method, in case her, his, she, the kiss} keyword set is combined into example, corresponding skip list, Hash table and prefix table.

Fig. 2 is in the inventive method, skip list and antistop list before setting optimum m the window's position that keyword is " cooperation ".

Fig. 3 is a process flow diagram of setting optimum m the window's position of keyword " cooperation ".

Fig. 4 is skip list and the antistop list of setting behind optimum m the window's position of keyword " cooperation ".

Fig. 5 is skip list and the antistop list of setting before the corresponding jump value of data block that length in the optimum m window of keyword " cooperation " is B.

Fig. 6 sets that length is the process flow diagram of the corresponding jump value of data block of B in the optimum m window of keyword " cooperation ".

Fig. 7 is skip list and the antistop list of setting after the corresponding jump value of data block that length in the optimum m window of keyword " cooperation " is B.

Embodiment

This method is used for the large-scale and multi-key word matching method of text or network content analysis, at first determines in the keyword set the length m of short keyword, and m is the positive integer more than or equal to 4, sets up skip list and antistop list; The size window that is m is placed the place that begins of above-mentioned first keyword of keyword set, the data block that last B the character that enters in the window formed is carried out first Hash operation, wherein B is 3 or 4, retrieve above-mentioned skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then window is moved a character, repeat above-mentioned steps, if zero, then enter following steps, the data block that last B character in the window formed is carried out second Hash operation, retrieve above-mentioned antistop list with the cryptographic hash that obtains, if the keyword number of corresponding list item is non-vanishing, then window is moved a character, repeat above-mentioned steps, if zero, then current window is optimum m window, and writing down optimum m window and this keyword side-play amount P between beginning to locate, the value of P equals window and moves number of times, if window arrives first keyword low order end, preceding m the character of then specifying this keyword is optimum m window, and record shift quantity is zero; Repeat above step, obtain optimum m window and corresponding offset thereof in each keyword; According to all length in the optimum m window of all keywords in the above-mentioned keyword set and each window is the data block of B, revise the jump value in above-mentioned skip list and the antistop list, and keyword is associated with in the antistop list list item, obtain analyzing with skip list and analyze and use antistop list; The size window that is B is placed the place that begins of text to be analyzed or Web content, data in the window are carried out the 3rd Hash operation, retrieve above-mentioned analysis skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then according to jump value moving window, if the jump value is zero, then this data block is carried out the 4th Hash operation, retrieve above-mentioned analysis antistop list with the cryptographic hash that obtains, if be associated with one or more keywords that may mate in the corresponding list item, then successively keyword and text to be analyzed or the respective field in the Web content are carried out character relatively, and, arrive the text to be analyzed or the least significant end of Web content up to window according to the jump value moving window in this antistop list list item.

In the said method, the process that keyword and text to be analyzed or the respective field in the Web content are carried out the character comparison is successively, if the distance that first character of current B window and text to be analyzed or Web content begin to locate is T, obtain at first field to be compared in text to be analyzed or the Web content begin locate with text to be analyzed or Web content begin to locate apart from T-L, be the position that field to be compared begins to locate, L=m-B-P wherein, P are the side-play amount of the optimum m window of first keyword; With first character of first keyword and field to be compared begin the place align after one by one character compare, if coupling, then export the keyword number of matched position T-L and this keyword, and enter following steps, if do not match, then directly enter following steps; Repeat above step, up to relevant keyword relatively finish.

Particularly, the inventive method and existing WM method are compared, and three great improvement are arranged on gordian technique:

(1) the present invention adopts the two stages of compression Hash table that the skip list in the WM method is optimized.At first get 20 bits in the data block that length is the B byte (B can get 3 or 4) make up the first compression Hash table (below be also referred to as skip list, no specified otherwise, skip list all refers to the skip list among the present invention), be zero data block for jump value in the skip list then, introduce second hash function, in data block, get 17 bits in addition and make up the second compression Hash table, be called antistop list.This table is different from the Hash table of WM method, not only be associated with the keyword that may mate in the list item, also there is the jump value, this is because the hash-collision likelihood ratio of this method skip list is bigger, it is in zero the list item through being mapped to the jump value after the Hash operation that the data block that much has just jump value is arranged, and second Hash by antistop list can really to be zero data block with these data blocks and jump value make a distinction.It is zero item number that the two stages of compression Hash table can reduce under large scale key word collection jump value effectively, average jump value when increasing text or Web content coupling, thereby improved the speed of multi-key word coupling, simultaneously, compression and thought of classification also can reduce the consumption of memory headroom.

(2) dynamically cutting method improves matching speed by excavating the information that contains in the keyword set.When minimum length keywords is m in the keyword set, original WM method is just considered preceding m the character of each keyword simply, other characters in the keyword are not carried out pre-service, like this just may miss that some help promoting the information of matching speed in the keyword set.Based on such observation, the present invention has adopted a kind of thought of dynamic cutting to come keyword set is carried out pre-service, investigate that all length is the data block of m in each keyword, utilize heuristic rule to select the best of breed of these data blocks (selecting a data block in each keyword).Through researching and analysing, this method has adopted following two heuristic rules, the one, the jump value is that zero item number is few as far as possible in the skip list, the 2nd, in the antistop list in each list item related keyword mean number few as far as possible.

(3) the present invention has removed the prefix table in the WM method under the prerequisite that does not influence matching speed, has further reduced memory consumption.Because the heuristic rule two that adopts in the above-mentioned dynamic cutting thought, the keyword mean number of each list item association in the antistop list reduces, and this means to hit each time needs to carry out character behind the antistop list list item and confirm that relatively the keyword number that mates has just reduced.Take all factors into consideration behind the contribution of matching speed and the space consuming that needs, prefix table can remove in the method.

Below in conjunction with accompanying drawing, introduce content of the present invention in detail.The inventive method comprises keyword set pre-service and text or Web content two stages of coupling.

At first should scan whole keyword set in the pretreatment stage, determine wherein the length m of short keyword, m is greater than or equals 4, and setting data block size B, B can get 3 or 4, with the jump value initialization of all list items in skip list and the antistop list is m-B+1, and the keyword number of all list items is changed to zero in the antistop list.

Dynamically cut keyword according to above-mentioned two heuristic rules then, determine successively in each keyword improving the position of best m continuation character data block of matching speed (being called optimum m window).Be example with optimum m the window's position of setting keyword " cooperation " now, the idiographic flow of dynamic cutting method is described.Figure 2 shows that the situation of setting preceding skip list and antistop list.The minimum regular length m of keyword set is 8, and data block size B is 4.Note in the skip list "～ti " expression data block latter two byte (16 bit) be ti,～expression 4 bits that is to say that length with the ending of this 20 bit is that the cryptographic hash that the data block of B obtains through first Hash operation is exactly the index value of this skip list list item.In like manner, in antistop list "? ti " expression be that length with this 17 bit ending is that the cryptographic hash that the data block of B obtains through second Hash operation is exactly the index value of this keyword list item.Shown in Figure 3 is the flow process of definite " cooperation " optimum m window.With the length window that is m from the keyword Far Left, the data block that last B character formed in the window is through first Hash operation, with the cryptographic hash retrieval skip list that obtains, if the words that the jump of corresponding skip list list item value is non-vanishing, a byte then moves right window, repeat the process of front, if zero, then this data block is carried out second Hash operation, with the cryptographic hash search key table that obtains, if the keyword number of corresponding antistop list list item is zero, determine that then current window is optimum m window, and record current window and this keyword side-play amount P between beginning to locate.If the keyword number is non-vanishing, the byte that then window moved right repeats the process of front.If window arrives the end of keyword, and also do not determine optimum m window according to the condition of front, preceding m the character of then directly specifying this keyword is optimum m window, and record shift quantity P is zero.Side-play amount P is used in and determines the adjustment amount L that the text corresponding field begins to locate when calculating matching stage.Shown in Figure 4 is the situation of having determined skip list and antistop list behind the optimum m window of keyword " cooperation ".As can be seen, when optimum m window was determined, the skip list list item of the data block correspondence that last B character formed in this window and the jump value in the antistop list list item were by zero setting, and this keyword also has been associated with in the corresponding antistop list list item.

Determined after the optimum m window of all keywords, needed also to handle one by one that other all lengths in optimum m window are the data block of B in each keyword, revised their jump values in skip list and the corresponding list item of antistop list.Shown in Figure 5 is to set keyword " coo Peration" skip list and antistop list in the optimum m window before other data block jump values, shown in Figure 6 is to set keyword " coo Peration" flow process of other data block jump values in the optimum m window.The underscore of " peration " represents that this is the best m window of this keyword.This method is carried out first Hash operation with the Far Left of the size data block window that is B from optimum m window to data in the B window, with the cryptographic hash retrieval skip list that obtains, and revises the jump value of corresponding skip list list item.The jump value calculating method is: establishing this B byte data piece and the optimum m window side-play amount between beginning to locate is j, and the jump value of this data block is m-B-j, and what the jump value of corresponding skip list list item was then got is one less in this value and this list item initial value.If the jump value of this list item is zero, then above-mentioned B byte data piece is carried out the secondth Hash operation, retrieve skip list with the cryptographic hash that obtains, and revise the jump value of corresponding skip list list item.The jump value calculating method is identical with the computing method of top skip list jump value.Shown in Figure 7 is the situation of having determined in the optimum m window of keyword " cooperation " skip list and antistop list after other data block jump values.After handling all keywords according to top method, just generated complete skip list and antistop list, the keyword pretreatment stage finishes.

In matching process, the window that the present invention is B with a size at first places the place that begins of text or Web content, data block in the window is carried out the 3rd Hash operation, what the 3rd Hash operation here and first Hash operation of pretreatment stage were used is same hash function, with the cryptographic hash retrieval skip list that obtains, if corresponding skip list list item jump value is non-vanishing, then according to this jump value moving window and repeat above matching process; If the jump value is zero, then this data block is carried out the 4th Hash, what the 4th Hash operation here and second Hash operation of pretreatment stage were used is same hash function, with the cryptographic hash search key table that obtains, if this list item is associated with one or more keyword that may mate, then successively corresponding field in these keywords and the text is mated, calculate adjustment amount L during coupling earlier, L=m-B+P, P are the optimum m window offset amount of the current keyword that is compared.If the distance between B window and text begin to locate in the current text is T, be the alignment at T-L place then with side-play amount in first character of keyword that is compared and the text, compare one by one.Then to export matched position be side-play amount T-L to coupling, and the keyword of coupling number.Relevant keyword all relatively intacter after, be worth to come mobile B window according to the jump of this list item, arrive the least significant end of text up to the B window, matching process finishes.

Claims

1, a kind of large-scale and multi-key word matching method that is used for text or network content analysis is characterized in that this method comprises following each step:

2, the method for claim 1 is characterized in that wherein saidly successively keyword and text to be analyzed or the respective field in the Web content being carried out character process relatively, may further comprise the steps:

(3) repeating step (1), (2), up to relevant keyword relatively finish.