CN103186669B - Keyword fast filtering method - Google Patents
Keyword fast filtering method Download PDFInfo
- Publication number
- CN103186669B CN103186669B CN201310092433.XA CN201310092433A CN103186669B CN 103186669 B CN103186669 B CN 103186669B CN 201310092433 A CN201310092433 A CN 201310092433A CN 103186669 B CN103186669 B CN 103186669B
- Authority
- CN
- China
- Prior art keywords
- character
- keyword
- hash
- hash table
- initial
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of keyword fast filtering method, including keyword index step and Keywords matching step;The present invention carries out Hash coding to each keyword in keyword set, is then encoded by Hash and carries out Keywords matching, efficiency is higher than common multi-mode Keywords matching algorithm;The present invention triggers Keywords matching using end mark, with realize during textual scan current text not with the quick judgement of any Keywords matching, and just by Hash values match keyword only when initial and end character is consistent, so as to greatly accelerate textual scan and matching speed.
Description
Technical field
Present invention relates particularly to a kind of keyword fast filtering methods.
Background technology
In the research field of computer, Keywords matching is one of computer application occurred earliest, and what is solved asks
Whether topic is quickly judged in a certain data block comprising a certain or certain keyword in keyword set.Keyword match technique
It is widely used in the fields such as text-processing, network content analysis, intrusion detection, information retrieval and viral diagnosis.Recently as
The rapid development of network technology and the violent growth of user, transmitted data on network amount is increasing, and cyberspeak increasingly tends to
Diversification.How the improper speech of user is quickly shielded, the illegal vocabulary that filtering user is delivered in content has become keyword
One great difficult problem in matching field.
Early stage network application, the automatic machine that has been born, WM(Wu-Manber algorithms)Wait Keywords matchings algorithm.Automatically
Machine algorithm is a kind of most straightforward approach, each keyword is once matched with text to be matched, complexity exists
It needs to treat the secondary scannings of matched text progress N (keyword number) and matching under worst case.WM algorithms are by Sun in 1994
Another rapidly and effectively multi-pattern matching algorithm that Wu and Udi Manber are proposed.WM(Wu-Manber algorithms)Algorithm profit
With multistage matching and the thought of hash, SHIFT is utilized(Mobile table), HASH(Hash)And PREFIX(Prefix table)Three defeated
Enter the Hash table pre-established in set of modes and the thought jumped in BM algorithms a large amount of unnecessary matchings to be avoided to survey
Examination.By the two technologies, WM algorithms show very high efficiency in practical application.
WM algorithms are exact matching algorithms, are primarily adapted for use in the environment that text to be matched is the Arabic family of languages, Romance.
In Chinese field, the minimum unit of text is individual Chinese character, and word can be arbitrarily organized between Chinese character and Chinese character, vocabulary relative to appoint
What language is all most huge.This just needs have more accurate requirement to Keywords matching algorithm, so that existing key
Word matching algorithm cannot reach satisfactory efficiency in practical applications, particularly carry out the accurate matching of multi-key word, property
It can be especially poor.
Invention content
The technical problems to be solved by the invention are to provide a kind of keyword fast filtering method, not only overcome tradition
The drawbacks of keyword matching efficiency is low under pattern, at the same solve existing multi-mode keyword match can not for Chinese character field
The problem of playing efficient feature.
The present invention is to solve above-mentioned technical problem by the following technical programs:A kind of keyword fast filtering method, packet
Keyword index method and key word matching method are included,
The keyword index method includes the following contents:
Step 10:A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Kazakhstan
Uncommon table H1;
Step 20:Hash table H2 and Hash table H3 is built, the initial character of each keyword in the keyword set is breathed out
Uncommon value is stored in Hash table H2, and the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3
In;
Step 30:The keyword that number of characters is most in the keyword set is inquired, and the character for recording the keyword is long
Spend L;
The key word matching method includes the following contents:
Step 40:The linear list of construction one and the character length L equal lengths, obtains initial character list PL;
Step 50:Text to be matched is scanned into a character C from front to back;
Step 60:Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so,
The initial character list P is stored in by character C and its in the location address of text to be matchedL, and perform step 70;If it is not, it holds
Row step 70;
Step 70:Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing
Step 90;If it is not, then perform step 80;
Step 80:Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating
Flow;If it is not, then perform step 60;
Step 90:By the initial character list PLIn character extract a character B from bottom to top, judge character B whether be
Sky, if so, performing step 80;If it is not, then perform step 100;
Step 100:Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if
It is then to perform step 110;If it is not, then empty the initial character list PL, and perform step 80;
Step 110:The character string S between character B and character C in text to be matched is extracted, calculates the Kazakhstan of the character string S
Uncommon value Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, report is primary
Successful Keywords matching event, and empty initial character list PL, it is then back to step 80;If it is not, then enter step 120;
Step 120:By the initial character list PLIn character continue up extraction one character B, judge character B whether be
Sky, if so, performing step 80;If it is not, then perform step 100.
Further, the Hash value calculating method of each keyword in the keyword set in the step 10 is as follows:
Step 1:By the Hash codes H of keywordDIt is set as 0;
Step 2:Scan a character D from back to front to the keyword,
Step 3:Judge whether the character D is empty, if so, terminating flow;If it is not, then enter step four;
Step 4:By the HDMove to left four bit arithmetics, then by the character D cryptographic Hash and HDMake XOR operation, after operation
Result as HDNew value;Continue to extract character late D, repeat step 3 and step 4, until D is the last of keyword
One character, the operation result of last character is the cryptographic Hash of the keyword.
The beneficial effects of the present invention are:The present invention realizes the filtering sensitive words of content of text, can either meet efficiency
Requirement, while again reduce exploitation realize difficulty.
Specific embodiment
A kind of keyword fast filtering method, including keyword index method and key word matching method,
The keyword index method includes the following contents:
Step 10:A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Kazakhstan
Uncommon table H1;
Step 20:Hash table H2 and Hash table H3 is built, the initial character of each keyword in the keyword set is breathed out
Uncommon value is stored in Hash table H2, and the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3
In;It when the initial character of each keyword is there are during identical character, then does not repeat to record, when the trailing character of each keyword, there are identical
Character when, also not repeat record.
Step 30:The keyword that number of characters is most in the keyword set is inquired, and the character for recording the keyword is long
Spend L;
The key word matching method includes the following contents:
Step 40:The linear list of construction one and the character length L equal lengths, obtains initial character list PL;Construction one
A linear list with the character length L equal lengths, as deposit PLCharacter length be more than length L when then remove earliest addition
Character.
Step 50:Text to be matched is scanned into a character C from front to back;
Step 60:Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so,
The initial character list P is stored in by character C and its in the location address of text to be matchedL, and perform step 70;If it is not, it holds
Row step 70;
Step 70:Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing
Step 90;If it is not, then perform step 80;
Step 80:Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating
Flow;If it is not, then perform step 60;
Step 90:By the initial character list PLIn character extract a character B from bottom to top, judge character B whether be
Sky, if so, performing step 80;If it is not, then perform step 100;
Step 100:Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if
It is then to perform step 110;If it is not, then empty the initial character list PL, and perform step 80;
Step 110:The character string S between character B and character C in text to be matched is extracted, calculates the Kazakhstan of the character string S
Uncommon value Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, report is primary
Successful Keywords matching event, and empty initial character list PL, it is then back to step 80;If it is not, then enter step 120;
Step 120:By the initial character list PLIn character continue up extraction one character B, judge character B whether be
Sky, if so, performing step 80;If it is not, then perform step 100.
The Hash value calculating method of each keyword in the keyword set is as follows:
Step 1:By the Hash codes H of keywordDIt is set as 0;
Step 2:Scan a character D from back to front to the keyword,
Step 3:Judge whether the character D is empty, if so, terminating flow;If it is not, then enter step four;
Step 4:By the HDMove to left four bit arithmetics, then by the character D cryptographic Hash and HDMake XOR operation, after operation
Result as HDNew value;Continue to extract character late D, repeat step 3 and step 4, until D is the last of keyword
One character, the operation result of last character is the cryptographic Hash of the keyword.
In order to enable the Hash table of construction has preferable balance, need preferable for Hash table one uniformity of construction
Hash mapping function F, and the Hash mapping function F constructed should have higher computational efficiency, to reduce the rope of Hash table
Draw away pin.For the cryptographic Hash conflict occurred between keyword, the present invention is solved using chain technique, and principle is:By Hash
It is worth identical keyword to be concatenated into a chained list, the element in chained list presses dictionary ascending sort, then deposits chained list as codomain
Enter Hash table H1.
The cryptographic Hash Hs computational methods of the character string S are consistent with the Hash value calculating method of above-mentioned keyword.
Since the present invention is using a kind of end mark triggering Keywords matching algorithm, for text ABCDE to be matched, it is assumed that A,
B, C, D belong to initial character set i.e. Hash table H2, if E is end mark, can retrieve ABCDE, BCDE, CDE, DE, E.Such as
Fruit uses the method for calculating cryptographic Hash from back to front, that can calculate the cryptographic Hash of DE using the cryptographic Hash and character D of E,
Without calculating again E characters, therefore the cryptographic Hash of the Hash function construction keyword has higher efficiency.
To be best understood from the present invention, spy is exemplified below:
Definition P is keyword set, is expressed as P [Beijing, Shanghai, Xiamen, Huhehaote, Haikou, Chengdu, Yichang, salt
City].
Definition T is text to be matched, is expressed as T [Xiamen and Shanghai be all livable tourist city]
It is as follows according to the pattern matching process of the method for the present invention:
First, Hash table is built:
H1=[309719:Beijing;432506:Shanghai;593190:Xiamen;122676668:Huhehaote;348999:Sea
Mouthful;617920:Chengdu;408412:Yichang;330288:Yancheng];
H2=[21271:North;19978:On;21414:Tall building;21628:It exhales;28023:Sea;25104:Into;23452:Preferably;
30416:Salt];
H3=[20140:Capital;28023:Sea;38376:Door;29305:It is special;21475:Mouthful;37117:All;26124:It is prosperous;
22478:City];
Longest keyword is [Huhehaote], so L=4.
Linear list, that is, initial character list P that one length of structure is 4L, for storing the initial character of successful match.
Then, the first character [tall building] of text T to be matched is scanned, it is seen that [tall building] belongs to initial character set i.e. Hash table
H2 is 1 in the location address of text to be matched, is deposited into PLObtain PL=[(tall building, 1)];
Character late [door] is scanned, [door] belongs to trailing character set i.e. Hash table H3, and extraction [door] is between [tall building]
Character string [Xiamen], the length in [Xiamen] is 2, less than L.The cryptographic Hash for calculating [Xiamen] is 593190, is obtained from H1 identical
The keyword of cryptographic Hash is [Xiamen], and successful match empties PL, PL=[]。
Scanning character late [and], [and] be both not belonging to Hash table H2 or be not belonging to Hash table H3, skip matching step.
Scan character late [on], judge [on] belong to the initial character of keyword, will [on] and its in text to be matched
Location address 4 be stored in PL, PL=[(on, 4)].
Character late [sea] is scanned, first determines whether that [sea] belongs to the initial character of keyword, by [sea] and its to be matched
The location address 5 of text is stored in PL, PL=[(on, 4);(sea, 5)];Judge that [sea] belongs to Hash table H3 again, extract [sea] supreme one
Character string [sea] between a initial character [sea] scanned, the cryptographic Hash for calculating [sea] is 28023, and phase is not matched to from H1
Same keyword;Continue to extract [sea] supreme one scan initial character [on] between character string [Shanghai], calculating [on
Sea] cryptographic Hash for 432506, keyword that identical cryptographic Hash is obtained from H1 is [Shanghai], successful match.Empty PL, PL=
[]。
Character late [all] is scanned, judgement [all] belongs to Hash table H3, at this time PLFor sky, matching step is skipped.
When scanning is to [suitable], judge that [suitable] belongs to Hash table H2, by [suitable] and its location address 8 in text to be matched
It is stored in PL, PL=[(preferably, 8)].
When continuing to scan to [city] backward, judge that [city] belongs to Hash table H3, extract [city] supreme head scanned
Character string [livable tourism city] between character [suitable] since the length of the character string is 6, more than L, therefore is directly skipped
With step.
So repeatedly, until entire textual scan to be matched terminates.
It is to be wasted to eliminate by traversing the time relatively brought come search key that the present invention, which establishes Hash table, is being searched
During rope is matched, the present invention quickly judges whether successful match according to the cryptographic Hash of keyword.
The present invention carries out Hash coding to each keyword in keyword set, is then encoded by Hash and carries out key
Word matches, and efficiency is higher than common multi-mode Keywords matching algorithm.The present invention triggers keyword using end mark
Match, with realize during textual scan current text not with the quick judgement of any Keywords matching, and only when initial and end word
Just by Hash values match keyword when according with consistent, so as to greatly accelerate textual scan and matching speed.
Claims (2)
1. a kind of keyword fast filtering method, it is characterised in that:Including keyword index method and key word matching method,
The keyword index method includes the following contents:
Step 10:A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Hash table
H1;
Step 20:Hash table H2 and Hash table H3 is built, by the initial character cryptographic Hash of each keyword in the keyword set
Hash table H2 is stored in, the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3;
Step 30:The keyword that number of characters is most in the keyword set is inquired, and records the character length L of the keyword;
The key word matching method includes the following contents:
Step 40:The linear list of construction one and the character length L equal lengths, obtains initial character list PL;
Step 50:Text to be matched is scanned into a character C from front to back;
Step 60:Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so, by word
It accords with C and its is stored in the initial character list P in the location address of text to be matchedL, and perform step 70;If it is not, perform step
Rapid 70;
Step 70:Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing step
90;If it is not, then perform step 80;
Step 80:Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating flow;
If it is not, then perform step 60;
Step 90:By the initial character list PLIn character extract a character B from back to front, judge whether character B is empty, if
It is then to perform step 80;If it is not, then perform step 100;
Step 100:Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if so,
Perform step 110;If it is not, then empty the initial character list PL, and perform step 80;
Step 110:The character string S between character B and character C in text to be matched is extracted, calculates the cryptographic Hash of the character string S
Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, the primary success of report
Keywords matching event, and empty initial character list PL, it is then back to step 80;If it is not, then enter step 120;
Step 120:By the initial character list PLIn character continuation extract a character B forward, judge whether character B is empty, if
It is then to perform step 80;If it is not, then perform step 100.
2. keyword fast filtering method as described in claim 1, it is characterised in that:Keyword set in the step 10
In each keyword Hash value calculating method it is as follows:
Step 1:By the Hash codes H of keywordDIt is set as 0;
Step 2:Scan a character D from back to front to the keyword,
Step 3:Judge whether the character D is empty, if so, terminating flow;If it is not, then enter step four;
Step 4:By the HDMove to left four bit arithmetics, then by the character D cryptographic Hash and HDMake XOR operation, the knot after operation
Fruit is as HDNew value;Continue to extract character late D, repeat step 3 and step 4, until D is the last one of keyword
Character, the operation result of last character is the cryptographic Hash of the keyword.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310092433.XA CN103186669B (en) | 2013-03-21 | 2013-03-21 | Keyword fast filtering method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310092433.XA CN103186669B (en) | 2013-03-21 | 2013-03-21 | Keyword fast filtering method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103186669A CN103186669A (en) | 2013-07-03 |
CN103186669B true CN103186669B (en) | 2018-07-06 |
Family
ID=48677836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310092433.XA Active CN103186669B (en) | 2013-03-21 | 2013-03-21 | Keyword fast filtering method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103186669B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103885964B (en) * | 2012-12-20 | 2017-06-27 | 北京新媒传信科技有限公司 | A kind of content auditing method and system |
CN104809138B (en) * | 2014-01-28 | 2018-06-08 | 阿里巴巴集团控股有限公司 | A kind of vocabulary management method and equipment based on hash processing |
CN104602206A (en) * | 2014-12-31 | 2015-05-06 | 上海大汉三通通信股份有限公司 | Spam message identification method and system |
CN106484730A (en) * | 2015-08-31 | 2017-03-08 | 北京国双科技有限公司 | Character string matching method and device |
CN105260357B (en) * | 2015-10-14 | 2018-03-30 | 北京京东尚科信息技术有限公司 | Sensitive word inspection method and equipment based on Hash digraph |
CN107402940B (en) * | 2016-05-20 | 2020-02-07 | 北京京东尚科信息技术有限公司 | Sensitive word retrieval method and device and storage medium |
CN109145283B (en) * | 2017-06-17 | 2022-03-15 | 黄冈 | Artificial intelligent sensitive information detection method |
CN109783607B (en) * | 2018-12-19 | 2023-04-25 | 南京莱斯信息技术股份有限公司 | Method for matching and identifying massive keywords in arbitrary text |
CN109933644B (en) * | 2019-03-22 | 2021-03-09 | 中国农业银行股份有限公司 | Character string matching method and device |
CN110222015B (en) * | 2019-06-19 | 2021-07-09 | 北京泰迪熊移动科技有限公司 | File data reading and querying method and device and readable storage medium |
JP7498844B1 (en) | 2023-12-27 | 2024-06-12 | ネットスター株式会社 | Web Filtering Systems |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11250086A (en) * | 1998-03-03 | 1999-09-17 | Hitachi Ltd | Retrieval support system |
CN101030221A (en) * | 2007-04-13 | 2007-09-05 | 清华大学 | Large-scale and multi-key word matching method for text or network content analysis |
CN101359325A (en) * | 2007-08-01 | 2009-02-04 | 北京启明星辰信息技术有限公司 | Multi-key-word matching method for rapidly analyzing content |
CN101398820A (en) * | 2007-09-24 | 2009-04-01 | 北京启明星辰信息技术有限公司 | Large scale key word matching method |
-
2013
- 2013-03-21 CN CN201310092433.XA patent/CN103186669B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11250086A (en) * | 1998-03-03 | 1999-09-17 | Hitachi Ltd | Retrieval support system |
CN101030221A (en) * | 2007-04-13 | 2007-09-05 | 清华大学 | Large-scale and multi-key word matching method for text or network content analysis |
CN101359325A (en) * | 2007-08-01 | 2009-02-04 | 北京启明星辰信息技术有限公司 | Multi-key-word matching method for rapidly analyzing content |
CN101398820A (en) * | 2007-09-24 | 2009-04-01 | 北京启明星辰信息技术有限公司 | Large scale key word matching method |
Non-Patent Citations (1)
Title |
---|
一种改进的Wu-Manber 多关键词匹配算法;张鑫等;《计算机应用》;20030731;第29-31页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103186669A (en) | 2013-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103186669B (en) | Keyword fast filtering method | |
CN107609356B (en) | Text carrier-free information hiding method based on label model | |
CN100452055C (en) | Large-scale and multi-key word matching method for text or network content analysis | |
CN106528599B (en) | A kind of character string Fast Fuzzy matching algorithm in magnanimity audio data | |
CN109241274B (en) | Text clustering method and device | |
CN104102720B (en) | The Forecasting Methodology and device efficiently input | |
CN102135814B (en) | A kind of character and word input method and system | |
CN103646018B (en) | A kind of Chinese word cutting method based on hash table dictionary structure | |
CN105160252B (en) | A kind of detection method and device of SQL injection attacks | |
CN106980656B (en) | A kind of searching method based on two-value code dictionary tree | |
CN106708956B (en) | A kind of HTTP data matching method based on more URL rule sets | |
CN105630765A (en) | Place name address identifying method | |
US9465860B2 (en) | Storage medium, trie tree generation method, and trie tree generation device | |
Crochemore et al. | Fast computation of a longest increasing subsequence and application | |
CN105141583B (en) | A kind of character string matching method and system | |
CN104951469B (en) | Optimize the method and apparatus of corpus | |
CN107608968A (en) | Chinese word cutting method, the device of text-oriented big data | |
CN107871002B (en) | Fingerprint fusion-based cross-language plagiarism detection method | |
CN102867049B (en) | Chinese PINYIN quick word segmentation method based on word search tree | |
CN104239321B (en) | A kind of data processing method and device of Search Engine-Oriented | |
CN103412858A (en) | Method for large-scale feature matching of text content or network content analyses | |
CN107748778A (en) | A kind of method and device for extracting address | |
CN108197315A (en) | A kind of method and apparatus for establishing participle index database | |
CN107357843A (en) | Mass network data search method based on data flow architecture | |
CN107085568A (en) | A kind of text similarity method of discrimination and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
EXSB | Decision made by sipo to initiate substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |