CN103186669B

CN103186669B - Keyword fast filtering method

Info

Publication number: CN103186669B
Application number: CN201310092433.XA
Authority: CN
Inventors: 余小峰; 程行荣; 林贺辉; 时宜
Original assignee: Xiamen Yaxon Networks Co Ltd
Current assignee: Xiamen Yaxon Networks Co Ltd
Priority date: 2013-03-21
Filing date: 2013-03-21
Publication date: 2018-07-06
Anticipated expiration: 2033-03-21
Also published as: CN103186669A

Abstract

The present invention provides a kind of keyword fast filtering method, including keyword index step and Keywords matching step；The present invention carries out Hash coding to each keyword in keyword set, is then encoded by Hash and carries out Keywords matching, efficiency is higher than common multi-mode Keywords matching algorithm；The present invention triggers Keywords matching using end mark, with realize during textual scan current text not with the quick judgement of any Keywords matching, and just by Hash values match keyword only when initial and end character is consistent, so as to greatly accelerate textual scan and matching speed.

Description

Keyword fast filtering method

Technical field

Present invention relates particularly to a kind of keyword fast filtering methods.

Background technology

In the research field of computer, Keywords matching is one of computer application occurred earliest, and what is solved asks Whether topic is quickly judged in a certain data block comprising a certain or certain keyword in keyword set.Keyword match technique It is widely used in the fields such as text-processing, network content analysis, intrusion detection, information retrieval and viral diagnosis.Recently as The rapid development of network technology and the violent growth of user, transmitted data on network amount is increasing, and cyberspeak increasingly tends to Diversification.How the improper speech of user is quickly shielded, the illegal vocabulary that filtering user is delivered in content has become keyword One great difficult problem in matching field.

Early stage network application, the automatic machine that has been born, WM（Wu-Manber algorithms）Wait Keywords matchings algorithm.Automatically Machine algorithm is a kind of most straightforward approach, each keyword is once matched with text to be matched, complexity exists It needs to treat the secondary scannings of matched text progress N (keyword number) and matching under worst case.WM algorithms are by Sun in 1994 Another rapidly and effectively multi-pattern matching algorithm that Wu and Udi Manber are proposed.WM（Wu-Manber algorithms）Algorithm profit With multistage matching and the thought of hash, SHIFT is utilized（Mobile table）, HASH（Hash）And PREFIX（Prefix table）Three defeated Enter the Hash table pre-established in set of modes and the thought jumped in BM algorithms a large amount of unnecessary matchings to be avoided to survey Examination.By the two technologies, WM algorithms show very high efficiency in practical application.

WM algorithms are exact matching algorithms, are primarily adapted for use in the environment that text to be matched is the Arabic family of languages, Romance. In Chinese field, the minimum unit of text is individual Chinese character, and word can be arbitrarily organized between Chinese character and Chinese character, vocabulary relative to appoint What language is all most huge.This just needs have more accurate requirement to Keywords matching algorithm, so that existing key Word matching algorithm cannot reach satisfactory efficiency in practical applications, particularly carry out the accurate matching of multi-key word, property It can be especially poor.

Invention content

The technical problems to be solved by the invention are to provide a kind of keyword fast filtering method, not only overcome tradition The drawbacks of keyword matching efficiency is low under pattern, at the same solve existing multi-mode keyword match can not for Chinese character field The problem of playing efficient feature.

The present invention is to solve above-mentioned technical problem by the following technical programs：A kind of keyword fast filtering method, packet Keyword index method and key word matching method are included,

The keyword index method includes the following contents：

Step 10：A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Kazakhstan Uncommon table H1；

Step 20：Hash table H2 and Hash table H3 is built, the initial character of each keyword in the keyword set is breathed out Uncommon value is stored in Hash table H2, and the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3 In；

Step 30：The keyword that number of characters is most in the keyword set is inquired, and the character for recording the keyword is long Spend L；

The key word matching method includes the following contents：

Step 40：The linear list of construction one and the character length L equal lengths, obtains initial character list P_L；

Step 50：Text to be matched is scanned into a character C from front to back；

Step 60：Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so, The initial character list P is stored in by character C and its in the location address of text to be matched_L, and perform step 70；If it is not, it holds Row step 70；

Step 70：Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing Step 90；If it is not, then perform step 80；

Step 80：Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating Flow；If it is not, then perform step 60；

Step 90：By the initial character list P_LIn character extract a character B from bottom to top, judge character B whether be Sky, if so, performing step 80；If it is not, then perform step 100；

Step 100：Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if It is then to perform step 110；If it is not, then empty the initial character list P_L, and perform step 80；

Step 110：The character string S between character B and character C in text to be matched is extracted, calculates the Kazakhstan of the character string S Uncommon value Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, report is primary Successful Keywords matching event, and empty initial character list P_L, it is then back to step 80；If it is not, then enter step 120；

Step 120：By the initial character list P_LIn character continue up extraction one character B, judge character B whether be Sky, if so, performing step 80；If it is not, then perform step 100.

Further, the Hash value calculating method of each keyword in the keyword set in the step 10 is as follows：

Step 1：By the Hash codes H of keyword_DIt is set as 0；

Step 2：Scan a character D from back to front to the keyword,

Step 3：Judge whether the character D is empty, if so, terminating flow；If it is not, then enter step four；

Step 4：By the H_DMove to left four bit arithmetics, then by the character D cryptographic Hash and H_DMake XOR operation, after operation Result as H_DNew value；Continue to extract character late D, repeat step 3 and step 4, until D is the last of keyword One character, the operation result of last character is the cryptographic Hash of the keyword.

The beneficial effects of the present invention are：The present invention realizes the filtering sensitive words of content of text, can either meet efficiency Requirement, while again reduce exploitation realize difficulty.

Specific embodiment

A kind of keyword fast filtering method, including keyword index method and key word matching method,

The keyword index method includes the following contents：

Step 20：Hash table H2 and Hash table H3 is built, the initial character of each keyword in the keyword set is breathed out Uncommon value is stored in Hash table H2, and the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3 In；It when the initial character of each keyword is there are during identical character, then does not repeat to record, when the trailing character of each keyword, there are identical Character when, also not repeat record.

The key word matching method includes the following contents：

Step 40：The linear list of construction one and the character length L equal lengths, obtains initial character list P_L；Construction one A linear list with the character length L equal lengths, as deposit P_LCharacter length be more than length L when then remove earliest addition Character.

The Hash value calculating method of each keyword in the keyword set is as follows：

Step 1：By the Hash codes H of keyword_DIt is set as 0；

Step 2：Scan a character D from back to front to the keyword,

In order to enable the Hash table of construction has preferable balance, need preferable for Hash table one uniformity of construction Hash mapping function F, and the Hash mapping function F constructed should have higher computational efficiency, to reduce the rope of Hash table Draw away pin.For the cryptographic Hash conflict occurred between keyword, the present invention is solved using chain technique, and principle is：By Hash It is worth identical keyword to be concatenated into a chained list, the element in chained list presses dictionary ascending sort, then deposits chained list as codomain Enter Hash table H1.

The cryptographic Hash Hs computational methods of the character string S are consistent with the Hash value calculating method of above-mentioned keyword.

Since the present invention is using a kind of end mark triggering Keywords matching algorithm, for text ABCDE to be matched, it is assumed that A, B, C, D belong to initial character set i.e. Hash table H2, if E is end mark, can retrieve ABCDE, BCDE, CDE, DE, E.Such as Fruit uses the method for calculating cryptographic Hash from back to front, that can calculate the cryptographic Hash of DE using the cryptographic Hash and character D of E, Without calculating again E characters, therefore the cryptographic Hash of the Hash function construction keyword has higher efficiency.

To be best understood from the present invention, spy is exemplified below：

Definition P is keyword set, is expressed as P [Beijing, Shanghai, Xiamen, Huhehaote, Haikou, Chengdu, Yichang, salt City].

Definition T is text to be matched, is expressed as T [Xiamen and Shanghai be all livable tourist city]

It is as follows according to the pattern matching process of the method for the present invention：

First, Hash table is built：

H1=[309719：Beijing；432506：Shanghai；593190：Xiamen；122676668：Huhehaote；348999：Sea Mouthful；617920：Chengdu；408412：Yichang；330288：Yancheng]；

H2=[21271：North；19978：On；21414：Tall building；21628：It exhales；28023：Sea；25104：Into；23452：Preferably； 30416：Salt]；

H3=[20140：Capital；28023：Sea；38376：Door；29305：It is special；21475：Mouthful；37117：All；26124：It is prosperous； 22478：City]；

Longest keyword is [Huhehaote], so L=4.

Linear list, that is, initial character list P that one length of structure is 4_L, for storing the initial character of successful match.

Then, the first character [tall building] of text T to be matched is scanned, it is seen that [tall building] belongs to initial character set i.e. Hash table H2 is 1 in the location address of text to be matched, is deposited into P_LObtain P_L=[(tall building, 1)]；

Character late [door] is scanned, [door] belongs to trailing character set i.e. Hash table H3, and extraction [door] is between [tall building] Character string [Xiamen], the length in [Xiamen] is 2, less than L.The cryptographic Hash for calculating [Xiamen] is 593190, is obtained from H1 identical The keyword of cryptographic Hash is [Xiamen], and successful match empties P_L, P_L=[]。

Scanning character late [and], [and] be both not belonging to Hash table H2 or be not belonging to Hash table H3, skip matching step.

Scan character late [on], judge [on] belong to the initial character of keyword, will [on] and its in text to be matched Location address 4 be stored in P_L, P_L=[(on, 4)].

Character late [sea] is scanned, first determines whether that [sea] belongs to the initial character of keyword, by [sea] and its to be matched The location address 5 of text is stored in P_L, P_L=[(on, 4)；(sea, 5)]；Judge that [sea] belongs to Hash table H3 again, extract [sea] supreme one Character string [sea] between a initial character [sea] scanned, the cryptographic Hash for calculating [sea] is 28023, and phase is not matched to from H1 Same keyword；Continue to extract [sea] supreme one scan initial character [on] between character string [Shanghai], calculating [on Sea] cryptographic Hash for 432506, keyword that identical cryptographic Hash is obtained from H1 is [Shanghai], successful match.Empty P_L, P_L= []。

Character late [all] is scanned, judgement [all] belongs to Hash table H3, at this time P_LFor sky, matching step is skipped.

When scanning is to [suitable], judge that [suitable] belongs to Hash table H2, by [suitable] and its location address 8 in text to be matched It is stored in P_L, P_L=[(preferably, 8)].

When continuing to scan to [city] backward, judge that [city] belongs to Hash table H3, extract [city] supreme head scanned Character string [livable tourism city] between character [suitable] since the length of the character string is 6, more than L, therefore is directly skipped With step.

So repeatedly, until entire textual scan to be matched terminates.

It is to be wasted to eliminate by traversing the time relatively brought come search key that the present invention, which establishes Hash table, is being searched During rope is matched, the present invention quickly judges whether successful match according to the cryptographic Hash of keyword.

The present invention carries out Hash coding to each keyword in keyword set, is then encoded by Hash and carries out key Word matches, and efficiency is higher than common multi-mode Keywords matching algorithm.The present invention triggers keyword using end mark Match, with realize during textual scan current text not with the quick judgement of any Keywords matching, and only when initial and end word Just by Hash values match keyword when according with consistent, so as to greatly accelerate textual scan and matching speed.

Claims

1. a kind of keyword fast filtering method, it is characterised in that：Including keyword index method and key word matching method,

The keyword index method includes the following contents：

Step 10：A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Hash table H1；

Step 20：Hash table H2 and Hash table H3 is built, by the initial character cryptographic Hash of each keyword in the keyword set Hash table H2 is stored in, the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3；

Step 30：The keyword that number of characters is most in the keyword set is inquired, and records the character length L of the keyword；

The key word matching method includes the following contents：

Step 60：Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so, by word It accords with C and its is stored in the initial character list P in the location address of text to be matched_L, and perform step 70；If it is not, perform step Rapid 70；

Step 80：Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating flow； If it is not, then perform step 60；

Step 90：By the initial character list P_LIn character extract a character B from back to front, judge whether character B is empty, if It is then to perform step 80；If it is not, then perform step 100；

Step 100：Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if so, Perform step 110；If it is not, then empty the initial character list P_L, and perform step 80；

Step 110：The character string S between character B and character C in text to be matched is extracted, calculates the cryptographic Hash of the character string S Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, the primary success of report Keywords matching event, and empty initial character list P_L, it is then back to step 80；If it is not, then enter step 120；

Step 120：By the initial character list P_LIn character continuation extract a character B forward, judge whether character B is empty, if It is then to perform step 80；If it is not, then perform step 100.

2. keyword fast filtering method as described in claim 1, it is characterised in that：Keyword set in the step 10 In each keyword Hash value calculating method it is as follows：

Step 1：By the Hash codes H of keyword_DIt is set as 0；

Step 2：Scan a character D from back to front to the keyword,

Step 4：By the H_DMove to left four bit arithmetics, then by the character D cryptographic Hash and H_DMake XOR operation, the knot after operation Fruit is as H_DNew value；Continue to extract character late D, repeat step 3 and step 4, until D is the last one of keyword Character, the operation result of last character is the cryptographic Hash of the keyword.