CN103186669B - Keyword fast filtering method - Google Patents

Keyword fast filtering method Download PDF

Info

Publication number
CN103186669B
CN103186669B CN201310092433.XA CN201310092433A CN103186669B CN 103186669 B CN103186669 B CN 103186669B CN 201310092433 A CN201310092433 A CN 201310092433A CN 103186669 B CN103186669 B CN 103186669B
Authority
CN
China
Prior art keywords
character
keyword
hash
hash table
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310092433.XA
Other languages
Chinese (zh)
Other versions
CN103186669A (en
Inventor
余小峰
程行荣
林贺辉
时宜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yaxon Networks Co Ltd
Original Assignee
Xiamen Yaxon Networks Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yaxon Networks Co Ltd filed Critical Xiamen Yaxon Networks Co Ltd
Priority to CN201310092433.XA priority Critical patent/CN103186669B/en
Publication of CN103186669A publication Critical patent/CN103186669A/en
Application granted granted Critical
Publication of CN103186669B publication Critical patent/CN103186669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of keyword fast filtering method, including keyword index step and Keywords matching step;The present invention carries out Hash coding to each keyword in keyword set, is then encoded by Hash and carries out Keywords matching, efficiency is higher than common multi-mode Keywords matching algorithm;The present invention triggers Keywords matching using end mark, with realize during textual scan current text not with the quick judgement of any Keywords matching, and just by Hash values match keyword only when initial and end character is consistent, so as to greatly accelerate textual scan and matching speed.

Description

Keyword fast filtering method
Technical field
Present invention relates particularly to a kind of keyword fast filtering methods.
Background technology
In the research field of computer, Keywords matching is one of computer application occurred earliest, and what is solved asks Whether topic is quickly judged in a certain data block comprising a certain or certain keyword in keyword set.Keyword match technique It is widely used in the fields such as text-processing, network content analysis, intrusion detection, information retrieval and viral diagnosis.Recently as The rapid development of network technology and the violent growth of user, transmitted data on network amount is increasing, and cyberspeak increasingly tends to Diversification.How the improper speech of user is quickly shielded, the illegal vocabulary that filtering user is delivered in content has become keyword One great difficult problem in matching field.
Early stage network application, the automatic machine that has been born, WM(Wu-Manber algorithms)Wait Keywords matchings algorithm.Automatically Machine algorithm is a kind of most straightforward approach, each keyword is once matched with text to be matched, complexity exists It needs to treat the secondary scannings of matched text progress N (keyword number) and matching under worst case.WM algorithms are by Sun in 1994 Another rapidly and effectively multi-pattern matching algorithm that Wu and Udi Manber are proposed.WM(Wu-Manber algorithms)Algorithm profit With multistage matching and the thought of hash, SHIFT is utilized(Mobile table), HASH(Hash)And PREFIX(Prefix table)Three defeated Enter the Hash table pre-established in set of modes and the thought jumped in BM algorithms a large amount of unnecessary matchings to be avoided to survey Examination.By the two technologies, WM algorithms show very high efficiency in practical application.
WM algorithms are exact matching algorithms, are primarily adapted for use in the environment that text to be matched is the Arabic family of languages, Romance. In Chinese field, the minimum unit of text is individual Chinese character, and word can be arbitrarily organized between Chinese character and Chinese character, vocabulary relative to appoint What language is all most huge.This just needs have more accurate requirement to Keywords matching algorithm, so that existing key Word matching algorithm cannot reach satisfactory efficiency in practical applications, particularly carry out the accurate matching of multi-key word, property It can be especially poor.
Invention content
The technical problems to be solved by the invention are to provide a kind of keyword fast filtering method, not only overcome tradition The drawbacks of keyword matching efficiency is low under pattern, at the same solve existing multi-mode keyword match can not for Chinese character field The problem of playing efficient feature.
The present invention is to solve above-mentioned technical problem by the following technical programs:A kind of keyword fast filtering method, packet Keyword index method and key word matching method are included,
The keyword index method includes the following contents:
Step 10:A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Kazakhstan Uncommon table H1;
Step 20:Hash table H2 and Hash table H3 is built, the initial character of each keyword in the keyword set is breathed out Uncommon value is stored in Hash table H2, and the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3 In;
Step 30:The keyword that number of characters is most in the keyword set is inquired, and the character for recording the keyword is long Spend L;
The key word matching method includes the following contents:
Step 40:The linear list of construction one and the character length L equal lengths, obtains initial character list PL
Step 50:Text to be matched is scanned into a character C from front to back;
Step 60:Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so, The initial character list P is stored in by character C and its in the location address of text to be matchedL, and perform step 70;If it is not, it holds Row step 70;
Step 70:Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing Step 90;If it is not, then perform step 80;
Step 80:Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating Flow;If it is not, then perform step 60;
Step 90:By the initial character list PLIn character extract a character B from bottom to top, judge character B whether be Sky, if so, performing step 80;If it is not, then perform step 100;
Step 100:Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if It is then to perform step 110;If it is not, then empty the initial character list PL, and perform step 80;
Step 110:The character string S between character B and character C in text to be matched is extracted, calculates the Kazakhstan of the character string S Uncommon value Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, report is primary Successful Keywords matching event, and empty initial character list PL, it is then back to step 80;If it is not, then enter step 120;
Step 120:By the initial character list PLIn character continue up extraction one character B, judge character B whether be Sky, if so, performing step 80;If it is not, then perform step 100.
Further, the Hash value calculating method of each keyword in the keyword set in the step 10 is as follows:
Step 1:By the Hash codes H of keywordDIt is set as 0;
Step 2:Scan a character D from back to front to the keyword,
Step 3:Judge whether the character D is empty, if so, terminating flow;If it is not, then enter step four;
Step 4:By the HDMove to left four bit arithmetics, then by the character D cryptographic Hash and HDMake XOR operation, after operation Result as HDNew value;Continue to extract character late D, repeat step 3 and step 4, until D is the last of keyword One character, the operation result of last character is the cryptographic Hash of the keyword.
The beneficial effects of the present invention are:The present invention realizes the filtering sensitive words of content of text, can either meet efficiency Requirement, while again reduce exploitation realize difficulty.
Specific embodiment
A kind of keyword fast filtering method, including keyword index method and key word matching method,
The keyword index method includes the following contents:
Step 10:A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Kazakhstan Uncommon table H1;
Step 20:Hash table H2 and Hash table H3 is built, the initial character of each keyword in the keyword set is breathed out Uncommon value is stored in Hash table H2, and the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3 In;It when the initial character of each keyword is there are during identical character, then does not repeat to record, when the trailing character of each keyword, there are identical Character when, also not repeat record.
Step 30:The keyword that number of characters is most in the keyword set is inquired, and the character for recording the keyword is long Spend L;
The key word matching method includes the following contents:
Step 40:The linear list of construction one and the character length L equal lengths, obtains initial character list PL;Construction one A linear list with the character length L equal lengths, as deposit PLCharacter length be more than length L when then remove earliest addition Character.
Step 50:Text to be matched is scanned into a character C from front to back;
Step 60:Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so, The initial character list P is stored in by character C and its in the location address of text to be matchedL, and perform step 70;If it is not, it holds Row step 70;
Step 70:Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing Step 90;If it is not, then perform step 80;
Step 80:Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating Flow;If it is not, then perform step 60;
Step 90:By the initial character list PLIn character extract a character B from bottom to top, judge character B whether be Sky, if so, performing step 80;If it is not, then perform step 100;
Step 100:Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if It is then to perform step 110;If it is not, then empty the initial character list PL, and perform step 80;
Step 110:The character string S between character B and character C in text to be matched is extracted, calculates the Kazakhstan of the character string S Uncommon value Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, report is primary Successful Keywords matching event, and empty initial character list PL, it is then back to step 80;If it is not, then enter step 120;
Step 120:By the initial character list PLIn character continue up extraction one character B, judge character B whether be Sky, if so, performing step 80;If it is not, then perform step 100.
The Hash value calculating method of each keyword in the keyword set is as follows:
Step 1:By the Hash codes H of keywordDIt is set as 0;
Step 2:Scan a character D from back to front to the keyword,
Step 3:Judge whether the character D is empty, if so, terminating flow;If it is not, then enter step four;
Step 4:By the HDMove to left four bit arithmetics, then by the character D cryptographic Hash and HDMake XOR operation, after operation Result as HDNew value;Continue to extract character late D, repeat step 3 and step 4, until D is the last of keyword One character, the operation result of last character is the cryptographic Hash of the keyword.
In order to enable the Hash table of construction has preferable balance, need preferable for Hash table one uniformity of construction Hash mapping function F, and the Hash mapping function F constructed should have higher computational efficiency, to reduce the rope of Hash table Draw away pin.For the cryptographic Hash conflict occurred between keyword, the present invention is solved using chain technique, and principle is:By Hash It is worth identical keyword to be concatenated into a chained list, the element in chained list presses dictionary ascending sort, then deposits chained list as codomain Enter Hash table H1.
The cryptographic Hash Hs computational methods of the character string S are consistent with the Hash value calculating method of above-mentioned keyword.
Since the present invention is using a kind of end mark triggering Keywords matching algorithm, for text ABCDE to be matched, it is assumed that A, B, C, D belong to initial character set i.e. Hash table H2, if E is end mark, can retrieve ABCDE, BCDE, CDE, DE, E.Such as Fruit uses the method for calculating cryptographic Hash from back to front, that can calculate the cryptographic Hash of DE using the cryptographic Hash and character D of E, Without calculating again E characters, therefore the cryptographic Hash of the Hash function construction keyword has higher efficiency.
To be best understood from the present invention, spy is exemplified below:
Definition P is keyword set, is expressed as P [Beijing, Shanghai, Xiamen, Huhehaote, Haikou, Chengdu, Yichang, salt City].
Definition T is text to be matched, is expressed as T [Xiamen and Shanghai be all livable tourist city]
It is as follows according to the pattern matching process of the method for the present invention:
First, Hash table is built:
H1=[309719:Beijing;432506:Shanghai;593190:Xiamen;122676668:Huhehaote;348999:Sea Mouthful;617920:Chengdu;408412:Yichang;330288:Yancheng];
H2=[21271:North;19978:On;21414:Tall building;21628:It exhales;28023:Sea;25104:Into;23452:Preferably; 30416:Salt];
H3=[20140:Capital;28023:Sea;38376:Door;29305:It is special;21475:Mouthful;37117:All;26124:It is prosperous; 22478:City];
Longest keyword is [Huhehaote], so L=4.
Linear list, that is, initial character list P that one length of structure is 4L, for storing the initial character of successful match.
Then, the first character [tall building] of text T to be matched is scanned, it is seen that [tall building] belongs to initial character set i.e. Hash table H2 is 1 in the location address of text to be matched, is deposited into PLObtain PL=[(tall building, 1)];
Character late [door] is scanned, [door] belongs to trailing character set i.e. Hash table H3, and extraction [door] is between [tall building] Character string [Xiamen], the length in [Xiamen] is 2, less than L.The cryptographic Hash for calculating [Xiamen] is 593190, is obtained from H1 identical The keyword of cryptographic Hash is [Xiamen], and successful match empties PL, PL=[]。
Scanning character late [and], [and] be both not belonging to Hash table H2 or be not belonging to Hash table H3, skip matching step.
Scan character late [on], judge [on] belong to the initial character of keyword, will [on] and its in text to be matched Location address 4 be stored in PL, PL=[(on, 4)].
Character late [sea] is scanned, first determines whether that [sea] belongs to the initial character of keyword, by [sea] and its to be matched The location address 5 of text is stored in PL, PL=[(on, 4);(sea, 5)];Judge that [sea] belongs to Hash table H3 again, extract [sea] supreme one Character string [sea] between a initial character [sea] scanned, the cryptographic Hash for calculating [sea] is 28023, and phase is not matched to from H1 Same keyword;Continue to extract [sea] supreme one scan initial character [on] between character string [Shanghai], calculating [on Sea] cryptographic Hash for 432506, keyword that identical cryptographic Hash is obtained from H1 is [Shanghai], successful match.Empty PL, PL= []。
Character late [all] is scanned, judgement [all] belongs to Hash table H3, at this time PLFor sky, matching step is skipped.
When scanning is to [suitable], judge that [suitable] belongs to Hash table H2, by [suitable] and its location address 8 in text to be matched It is stored in PL, PL=[(preferably, 8)].
When continuing to scan to [city] backward, judge that [city] belongs to Hash table H3, extract [city] supreme head scanned Character string [livable tourism city] between character [suitable] since the length of the character string is 6, more than L, therefore is directly skipped With step.
So repeatedly, until entire textual scan to be matched terminates.
It is to be wasted to eliminate by traversing the time relatively brought come search key that the present invention, which establishes Hash table, is being searched During rope is matched, the present invention quickly judges whether successful match according to the cryptographic Hash of keyword.
The present invention carries out Hash coding to each keyword in keyword set, is then encoded by Hash and carries out key Word matches, and efficiency is higher than common multi-mode Keywords matching algorithm.The present invention triggers keyword using end mark Match, with realize during textual scan current text not with the quick judgement of any Keywords matching, and only when initial and end word Just by Hash values match keyword when according with consistent, so as to greatly accelerate textual scan and matching speed.

Claims (2)

1. a kind of keyword fast filtering method, it is characterised in that:Including keyword index method and key word matching method,
The keyword index method includes the following contents:
Step 10:A Hash table H1 is built, the cryptographic Hash of each keyword in preset keyword set is stored in Hash table H1;
Step 20:Hash table H2 and Hash table H3 is built, by the initial character cryptographic Hash of each keyword in the keyword set Hash table H2 is stored in, the trailing character cryptographic Hash of each keyword in the keyword set is stored in Hash table H3;
Step 30:The keyword that number of characters is most in the keyword set is inquired, and records the character length L of the keyword;
The key word matching method includes the following contents:
Step 40:The linear list of construction one and the character length L equal lengths, obtains initial character list PL
Step 50:Text to be matched is scanned into a character C from front to back;
Step 60:Judge whether character C belongs to one of initial character of each keyword in the Hash table H2, if so, by word It accords with C and its is stored in the initial character list P in the location address of text to be matchedL, and perform step 70;If it is not, perform step Rapid 70;
Step 70:Judge whether character C belongs to one of trailing character of each keyword in Hash table H3, if so, performing step 90;If it is not, then perform step 80;
Step 80:Text to be matched is continued to scan a character C backward, judges whether character C is empty, if so, terminating flow; If it is not, then perform step 60;
Step 90:By the initial character list PLIn character extract a character B from back to front, judge whether character B is empty, if It is then to perform step 80;If it is not, then perform step 100;
Step 100:Calculating character B and location address difference Is of the character C in text to be matched, judges whether I≤L, if so, Perform step 110;If it is not, then empty the initial character list PL, and perform step 80;
Step 110:The character string S between character B and character C in text to be matched is extracted, calculates the cryptographic Hash of the character string S Hs, judges whether cryptographic Hash Hs belongs to one of cryptographic Hash of each keyword in the Hash table H1, if so, the primary success of report Keywords matching event, and empty initial character list PL, it is then back to step 80;If it is not, then enter step 120;
Step 120:By the initial character list PLIn character continuation extract a character B forward, judge whether character B is empty, if It is then to perform step 80;If it is not, then perform step 100.
2. keyword fast filtering method as described in claim 1, it is characterised in that:Keyword set in the step 10 In each keyword Hash value calculating method it is as follows:
Step 1:By the Hash codes H of keywordDIt is set as 0;
Step 2:Scan a character D from back to front to the keyword,
Step 3:Judge whether the character D is empty, if so, terminating flow;If it is not, then enter step four;
Step 4:By the HDMove to left four bit arithmetics, then by the character D cryptographic Hash and HDMake XOR operation, the knot after operation Fruit is as HDNew value;Continue to extract character late D, repeat step 3 and step 4, until D is the last one of keyword Character, the operation result of last character is the cryptographic Hash of the keyword.
CN201310092433.XA 2013-03-21 2013-03-21 Keyword fast filtering method Active CN103186669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310092433.XA CN103186669B (en) 2013-03-21 2013-03-21 Keyword fast filtering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310092433.XA CN103186669B (en) 2013-03-21 2013-03-21 Keyword fast filtering method

Publications (2)

Publication Number Publication Date
CN103186669A CN103186669A (en) 2013-07-03
CN103186669B true CN103186669B (en) 2018-07-06

Family

ID=48677836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310092433.XA Active CN103186669B (en) 2013-03-21 2013-03-21 Keyword fast filtering method

Country Status (1)

Country Link
CN (1) CN103186669B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103885964B (en) * 2012-12-20 2017-06-27 北京新媒传信科技有限公司 A kind of content auditing method and system
CN104809138B (en) * 2014-01-28 2018-06-08 阿里巴巴集团控股有限公司 A kind of vocabulary management method and equipment based on hash processing
CN104602206A (en) * 2014-12-31 2015-05-06 上海大汉三通通信股份有限公司 Spam message identification method and system
CN106484730A (en) * 2015-08-31 2017-03-08 北京国双科技有限公司 Character string matching method and device
CN105260357B (en) * 2015-10-14 2018-03-30 北京京东尚科信息技术有限公司 Sensitive word inspection method and equipment based on Hash digraph
CN107402940B (en) * 2016-05-20 2020-02-07 北京京东尚科信息技术有限公司 Sensitive word retrieval method and device and storage medium
CN109145283B (en) * 2017-06-17 2022-03-15 黄冈 Artificial intelligent sensitive information detection method
CN109783607B (en) * 2018-12-19 2023-04-25 南京莱斯信息技术股份有限公司 Method for matching and identifying massive keywords in arbitrary text
CN109933644B (en) * 2019-03-22 2021-03-09 中国农业银行股份有限公司 Character string matching method and device
CN110222015B (en) * 2019-06-19 2021-07-09 北京泰迪熊移动科技有限公司 File data reading and querying method and device and readable storage medium
JP7498844B1 (en) 2023-12-27 2024-06-12 ネットスター株式会社 Web Filtering Systems

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11250086A (en) * 1998-03-03 1999-09-17 Hitachi Ltd Retrieval support system
CN101030221A (en) * 2007-04-13 2007-09-05 清华大学 Large-scale and multi-key word matching method for text or network content analysis
CN101359325A (en) * 2007-08-01 2009-02-04 北京启明星辰信息技术有限公司 Multi-key-word matching method for rapidly analyzing content
CN101398820A (en) * 2007-09-24 2009-04-01 北京启明星辰信息技术有限公司 Large scale key word matching method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11250086A (en) * 1998-03-03 1999-09-17 Hitachi Ltd Retrieval support system
CN101030221A (en) * 2007-04-13 2007-09-05 清华大学 Large-scale and multi-key word matching method for text or network content analysis
CN101359325A (en) * 2007-08-01 2009-02-04 北京启明星辰信息技术有限公司 Multi-key-word matching method for rapidly analyzing content
CN101398820A (en) * 2007-09-24 2009-04-01 北京启明星辰信息技术有限公司 Large scale key word matching method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种改进的Wu-Manber 多关键词匹配算法;张鑫等;《计算机应用》;20030731;第29-31页 *

Also Published As

Publication number Publication date
CN103186669A (en) 2013-07-03

Similar Documents

Publication Publication Date Title
CN103186669B (en) Keyword fast filtering method
CN107609356B (en) Text carrier-free information hiding method based on label model
CN100452055C (en) Large-scale and multi-key word matching method for text or network content analysis
CN106528599B (en) A kind of character string Fast Fuzzy matching algorithm in magnanimity audio data
CN109241274B (en) Text clustering method and device
CN104102720B (en) The Forecasting Methodology and device efficiently input
CN102135814B (en) A kind of character and word input method and system
CN103646018B (en) A kind of Chinese word cutting method based on hash table dictionary structure
CN105160252B (en) A kind of detection method and device of SQL injection attacks
CN106980656B (en) A kind of searching method based on two-value code dictionary tree
CN106708956B (en) A kind of HTTP data matching method based on more URL rule sets
CN105630765A (en) Place name address identifying method
US9465860B2 (en) Storage medium, trie tree generation method, and trie tree generation device
Crochemore et al. Fast computation of a longest increasing subsequence and application
CN105141583B (en) A kind of character string matching method and system
CN104951469B (en) Optimize the method and apparatus of corpus
CN107608968A (en) Chinese word cutting method, the device of text-oriented big data
CN107871002B (en) Fingerprint fusion-based cross-language plagiarism detection method
CN102867049B (en) Chinese PINYIN quick word segmentation method based on word search tree
CN104239321B (en) A kind of data processing method and device of Search Engine-Oriented
CN103412858A (en) Method for large-scale feature matching of text content or network content analyses
CN107748778A (en) A kind of method and device for extracting address
CN108197315A (en) A kind of method and apparatus for establishing participle index database
CN107357843A (en) Mass network data search method based on data flow architecture
CN107085568A (en) A kind of text similarity method of discrimination and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant