CN101030221A - Large-scale and multi-key word matching method for text or network content analysis - Google Patents

Large-scale and multi-key word matching method for text or network content analysis Download PDF

Info

Publication number
CN101030221A
CN101030221A CN 200710065392 CN200710065392A CN101030221A CN 101030221 A CN101030221 A CN 101030221A CN 200710065392 CN200710065392 CN 200710065392 CN 200710065392 A CN200710065392 A CN 200710065392A CN 101030221 A CN101030221 A CN 101030221A
Authority
CN
China
Prior art keywords
window
keyword
list
text
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710065392
Other languages
Chinese (zh)
Other versions
CN100452055C (en
Inventor
周宗伟
薛一波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB200710065392XA priority Critical patent/CN100452055C/en
Publication of CN101030221A publication Critical patent/CN101030221A/en
Application granted granted Critical
Publication of CN100452055C publication Critical patent/CN100452055C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

A large-capacity multikey-word matching method used on content analysis of text or network includes setting up jump table and key word table, calculating jump value of each jump table item and correlating key word with key word table item, carrying out Hash operation on data block in window, indexing jump value in jump table and moving window as per said value if jump value is not zero or otherwise making Hash operation again data block, indexing key word table and comparing key word correlated with said table item with field in text in sequence for confirming whether they are matched with each other or not.

Description

A kind of large-scale and multi-key word matching method that is used for text or network content analysis
Technical field
The present invention relates to a kind of large-scale and multi-key word matching method that is used for text or network content analysis, relate in particular to the text under the large scale key word collection background or the treatment technology of Web content, belong to the microcomputer data processing field.
Background technology
The multi-key word coupling is one of basic problem in the computer science.The problem of its solution is to judge certain or some keyword that whether comprises in a certain text or the data block in the given keyword set rapidly and accurately.The multi-key word matching technique has been widely applied to the every field of network securitys such as fire wall, viral detection, intrusion detection and defence, information filtering now, simultaneously can also expand to other subjects, such as information management system, network search engines, the gene order in the bioinformatics detects or the like.Therefore research and the improvement for multi-key word matching method is of practical significance very much.
One of classical way that solves the multi-key word matching problem is based on the method for Hash skip list.This method is proposed in 1994 by the Sun Wu of Taiwan National Chung Cheng University and the Udi Manber of the upright university of State of Arizona, US the earliest, is named after its inventor usually, hereinafter to be referred as " WM method ".The basic ideas of WM method are to calculate that size is the corresponding jump value of the data block of B byte (B is generally 2 or 3) in the text in when coupling, are worth mobile text or the further checking keyword that may mate according to jump.The WM method is come specified data piece jump value by certain heuristic rule, avoids some unnecessary checkings, has adopted the method for Hash hash simultaneously in the foundation of jump value table, has accelerated matching speed.
The same with general multi-key word matching method, the WM method also is divided into pre-service and two stages of coupling.The WM method can be set up three tables in the pretreatment stage: Hash table, skip list and prefix table.Wherein, stored the jump value of corresponding data piece in the skip list, it is used to determine in the time of scan text have what characters to be skipped in the text.Hash table and prefix table are which keyword whether might mate with text generating coupling and affirmation in the keyword set with deciding when the jump tabular value is 0.If given keyword set be combined into her, his, she, the kiss} and the block size B=2 that fetches data, then Dui Ying skip list, Hash table and prefix table are as shown in Figure 1.In the text matches process, length is that the window of B places text to begin the place, the WM method is calculated the cryptographic hash of B character in the current window and is retrieved corresponding list item in the skip list, if jump is worth non-vanishing then moving window in the list item, then retrieve Hash table with same cryptographic hash again when being zero, find the keyword that may mate, filter out wherein a part with prefix table, the keyword that can't filter out is again with the method for character match one by one and judge whether to occur in this position coupling.
The WM method has very outstanding average behavior, matching speed is fast, the memory headroom that takies is less, it is a kind of method that on time and space, behaves oneself best in the present multi-key word coupling field, thereby obtained using widely, for example Linux order agrep, and the keyword matching module among the famous intruding detection system Snort that increases income.
It should be noted that the develop rapidly of following Internet technology, the research of multi-key word matching method has run into new challenge.Attack complicated day by day, network security key in application speech scale also constantly enlarge thereupon.According to the information of CNCERT/CC, present Web content and sensitive information filter used rule set and generally are several ten thousand, even reach tens0000.The virus characteristic storehouse clauses and subclauses of the famous anti-viral software clamAV that increases income have reached 49644, but also are constantly increasing.Based on harmonious present situation, the defined large scale key word collection of the present invention refers to that the keyword number reaches more than 10,000 in the set.Simultaneously, along with the network bandwidth also constantly increases, the processing power of network security protection system also must improve constantly.As core technology, this means that also the multi-key word matching performance also will improve constantly.The fact is the performance of most of network safety systems, and especially some real time processing systems are filtered or the like the application demand of also having no idea to satisfy as virus detection, traffic statistics analysis, sensitive information.More than two present situations disclosed the necessity of carrying out towards the high speed multi-key word matching method research of large scale key word collection.
But, can know by analysis for the WM algorithm flow, after keyword quantity increases, the jump value is that zero item number can increase in the skip list of WM method, average jump value can reduce thereupon, thereby can't avoid characters unnecessary in a large number in the matching process to compare effectively, cause the matching speed of text or Web content to descend.Experimental result shows that the WM method reaches in 100,000 at the keyword number, and matching speed descends very obvious, and extremely low handling capacity has been difficult to satisfy the practical requirement of text or Web content processing.
Summary of the invention
The objective of the invention is to propose a kind of large-scale and multi-key word matching method that is used for text or network content analysis.This method proposes dynamically to cut the thought of keyword first, fully excavate the heuristic information of keyword set, overcome the WM method under the large scale key word collection in the skip list jump value be zero the too much shortcoming of item, significantly improve the matching speed under the large scale key word collection.This method is passed through to introduce the two stages of compression Hash table simultaneously, and removes the prefix table of using in the WM method under keeping the prerequisite of matching performance, has reduced the consumption of memory headroom, has guaranteed good hardware extensibility, has promoted practical value.
This method is used for the large-scale and multi-key word matching method of text or network content analysis, comprises following each step:
(1) determine in the keyword set length m of short keyword, m is the positive integer more than or equal to 4, sets up skip list and antistop list;
(2) window that is m with a size places the place that begins of above-mentioned first keyword of keyword set, the data block that last B the character that enters in the window formed is carried out first Hash operation, wherein B is 3 or 4, retrieve above-mentioned skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then window is moved a character, repeating step (2), if zero, then enter step (3);
(3) data block that last B character in the above-mentioned window formed is carried out second Hash operation, retrieve above-mentioned antistop list with the cryptographic hash that obtains, if the keyword number of corresponding list item is non-vanishing, then window is moved a character, repeating step (2), if zero, then current window is optimum m window, and write down optimum m window and this keyword side-play amount P between beginning to locate, the value of P equals window and moves number of times, if window arrives first keyword low order end, preceding m the character of then specifying this keyword is optimum m window, and record shift quantity is zero;
(4) repeating step (2) and (3) obtain optimum m window and corresponding offset thereof in each keyword;
(5) be the data block of B according to all length in the optimum m window of all keywords in the above-mentioned keyword set and each window, revise the jump value in above-mentioned skip list and the antistop list, and keyword is associated with in the antistop list list item, obtain analyzing with skip list and analyze and use antistop list;
(6) window that is B with a size places the place that begins of text to be analyzed or Web content, data in the window are carried out the 3rd Hash operation, retrieve above-mentioned analysis skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then according to jump value moving window, if the jump value is zero, then this data block is carried out the 4th Hash operation, retrieve above-mentioned analysis antistop list with the cryptographic hash that obtains, if be associated with one or more keywords that may mate in the corresponding list item, then successively keyword and text to be analyzed or the respective field in the Web content are carried out character relatively, and, arrive the text to be analyzed or the least significant end of Web content up to window according to the jump value moving window in this antistop list list item.
In the said method, successively keyword and text to be analyzed or the respective field in the Web content are carried out character process relatively, may further comprise the steps:
(1) establishing the distance that first character of current B window and text to be analyzed or Web content begin to locate is T, obtain field to be compared in text to be analyzed or the Web content begin locate with text to be analyzed or Web content begin to locate apart from T-L, be the position that field to be compared begins to locate, L=m-B-P wherein, P are the side-play amount of the optimum m window of first keyword;
(2) first character of first keyword and field to be compared are begun the place align after one by one character compare, if coupling is then exported the keyword number of matched position T-L and this keyword, and is entered step (3), if do not match, then directly enter step (3);
(3) repeating step (1), (2), up to relevant keyword relatively finish.
The large-scale and multi-key word matching method that is used for text or network content analysis that the present invention proposes, its advantage is as follows:
1, algorithm evaluation and test excellent performance: through experiment, the high speed large-scale and multi-key word matching method that the present invention proposes is compared with existing WM method, and the room and time performance is all than more excellent.Under several 10,000 to 100,000 the situations of keyword, fast 2 to 4 times of the matching speeds of this method than WM method, the memory headroom that takies remains basically stable with the WM method under the B=2 situation, but lacks 75% to 90% than the WM method under the B=3 situation.
2, satisfy practical the requirement: in the practical application, the method that this invention proposes can reach the handling capacity (being text or the Web content that mates 75MB p.s.) of 600Mbps under several 100,000 situations of keyword, and the memory headroom that takies only has only about 5MB, can satisfy the performance requirement that present most of this paper or Web content are handled well.
3, the hardware extensibility is good: lower memory headroom consumption has guaranteed that the method that the present invention proposes might be loaded in the hardware-accelerated system, or exploitation becomes special-purpose SOC (system on a chip), thereby further improve matching speed, and have good market outlook.
Description of drawings
Fig. 1 is in the existing WM method, in case her, his, she, the kiss} keyword set is combined into example, corresponding skip list, Hash table and prefix table.
Fig. 2 is in the inventive method, skip list and antistop list before setting optimum m the window's position that keyword is " cooperation ".
Fig. 3 is a process flow diagram of setting optimum m the window's position of keyword " cooperation ".
Fig. 4 is skip list and the antistop list of setting behind optimum m the window's position of keyword " cooperation ".
Fig. 5 is skip list and the antistop list of setting before the corresponding jump value of data block that length in the optimum m window of keyword " cooperation " is B.
Fig. 6 sets that length is the process flow diagram of the corresponding jump value of data block of B in the optimum m window of keyword " cooperation ".
Fig. 7 is skip list and the antistop list of setting after the corresponding jump value of data block that length in the optimum m window of keyword " cooperation " is B.
Embodiment
This method is used for the large-scale and multi-key word matching method of text or network content analysis, at first determines in the keyword set the length m of short keyword, and m is the positive integer more than or equal to 4, sets up skip list and antistop list; The size window that is m is placed the place that begins of above-mentioned first keyword of keyword set, the data block that last B the character that enters in the window formed is carried out first Hash operation, wherein B is 3 or 4, retrieve above-mentioned skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then window is moved a character, repeat above-mentioned steps, if zero, then enter following steps, the data block that last B character in the window formed is carried out second Hash operation, retrieve above-mentioned antistop list with the cryptographic hash that obtains, if the keyword number of corresponding list item is non-vanishing, then window is moved a character, repeat above-mentioned steps, if zero, then current window is optimum m window, and writing down optimum m window and this keyword side-play amount P between beginning to locate, the value of P equals window and moves number of times, if window arrives first keyword low order end, preceding m the character of then specifying this keyword is optimum m window, and record shift quantity is zero; Repeat above step, obtain optimum m window and corresponding offset thereof in each keyword; According to all length in the optimum m window of all keywords in the above-mentioned keyword set and each window is the data block of B, revise the jump value in above-mentioned skip list and the antistop list, and keyword is associated with in the antistop list list item, obtain analyzing with skip list and analyze and use antistop list; The size window that is B is placed the place that begins of text to be analyzed or Web content, data in the window are carried out the 3rd Hash operation, retrieve above-mentioned analysis skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then according to jump value moving window, if the jump value is zero, then this data block is carried out the 4th Hash operation, retrieve above-mentioned analysis antistop list with the cryptographic hash that obtains, if be associated with one or more keywords that may mate in the corresponding list item, then successively keyword and text to be analyzed or the respective field in the Web content are carried out character relatively, and, arrive the text to be analyzed or the least significant end of Web content up to window according to the jump value moving window in this antistop list list item.
In the said method, the process that keyword and text to be analyzed or the respective field in the Web content are carried out the character comparison is successively, if the distance that first character of current B window and text to be analyzed or Web content begin to locate is T, obtain at first field to be compared in text to be analyzed or the Web content begin locate with text to be analyzed or Web content begin to locate apart from T-L, be the position that field to be compared begins to locate, L=m-B-P wherein, P are the side-play amount of the optimum m window of first keyword; With first character of first keyword and field to be compared begin the place align after one by one character compare, if coupling, then export the keyword number of matched position T-L and this keyword, and enter following steps, if do not match, then directly enter following steps; Repeat above step, up to relevant keyword relatively finish.
Particularly, the inventive method and existing WM method are compared, and three great improvement are arranged on gordian technique:
(1) the present invention adopts the two stages of compression Hash table that the skip list in the WM method is optimized.At first get 20 bits in the data block that length is the B byte (B can get 3 or 4) make up the first compression Hash table (below be also referred to as skip list, no specified otherwise, skip list all refers to the skip list among the present invention), be zero data block for jump value in the skip list then, introduce second hash function, in data block, get 17 bits in addition and make up the second compression Hash table, be called antistop list.This table is different from the Hash table of WM method, not only be associated with the keyword that may mate in the list item, also there is the jump value, this is because the hash-collision likelihood ratio of this method skip list is bigger, it is in zero the list item through being mapped to the jump value after the Hash operation that the data block that much has just jump value is arranged, and second Hash by antistop list can really to be zero data block with these data blocks and jump value make a distinction.It is zero item number that the two stages of compression Hash table can reduce under large scale key word collection jump value effectively, average jump value when increasing text or Web content coupling, thereby improved the speed of multi-key word coupling, simultaneously, compression and thought of classification also can reduce the consumption of memory headroom.
(2) dynamically cutting method improves matching speed by excavating the information that contains in the keyword set.When minimum length keywords is m in the keyword set, original WM method is just considered preceding m the character of each keyword simply, other characters in the keyword are not carried out pre-service, like this just may miss that some help promoting the information of matching speed in the keyword set.Based on such observation, the present invention has adopted a kind of thought of dynamic cutting to come keyword set is carried out pre-service, investigate that all length is the data block of m in each keyword, utilize heuristic rule to select the best of breed of these data blocks (selecting a data block in each keyword).Through researching and analysing, this method has adopted following two heuristic rules, the one, the jump value is that zero item number is few as far as possible in the skip list, the 2nd, in the antistop list in each list item related keyword mean number few as far as possible.
(3) the present invention has removed the prefix table in the WM method under the prerequisite that does not influence matching speed, has further reduced memory consumption.Because the heuristic rule two that adopts in the above-mentioned dynamic cutting thought, the keyword mean number of each list item association in the antistop list reduces, and this means to hit each time needs to carry out character behind the antistop list list item and confirm that relatively the keyword number that mates has just reduced.Take all factors into consideration behind the contribution of matching speed and the space consuming that needs, prefix table can remove in the method.
Below in conjunction with accompanying drawing, introduce content of the present invention in detail.The inventive method comprises keyword set pre-service and text or Web content two stages of coupling.
At first should scan whole keyword set in the pretreatment stage, determine wherein the length m of short keyword, m is greater than or equals 4, and setting data block size B, B can get 3 or 4, with the jump value initialization of all list items in skip list and the antistop list is m-B+1, and the keyword number of all list items is changed to zero in the antistop list.
Dynamically cut keyword according to above-mentioned two heuristic rules then, determine successively in each keyword improving the position of best m continuation character data block of matching speed (being called optimum m window).Be example with optimum m the window's position of setting keyword " cooperation " now, the idiographic flow of dynamic cutting method is described.Figure 2 shows that the situation of setting preceding skip list and antistop list.The minimum regular length m of keyword set is 8, and data block size B is 4.Note in the skip list "~ti " expression data block latter two byte (16 bit) be ti,~expression 4 bits that is to say that length with the ending of this 20 bit is that the cryptographic hash that the data block of B obtains through first Hash operation is exactly the index value of this skip list list item.In like manner, in antistop list "? ti " expression be that length with this 17 bit ending is that the cryptographic hash that the data block of B obtains through second Hash operation is exactly the index value of this keyword list item.Shown in Figure 3 is the flow process of definite " cooperation " optimum m window.With the length window that is m from the keyword Far Left, the data block that last B character formed in the window is through first Hash operation, with the cryptographic hash retrieval skip list that obtains, if the words that the jump of corresponding skip list list item value is non-vanishing, a byte then moves right window, repeat the process of front, if zero, then this data block is carried out second Hash operation, with the cryptographic hash search key table that obtains, if the keyword number of corresponding antistop list list item is zero, determine that then current window is optimum m window, and record current window and this keyword side-play amount P between beginning to locate.If the keyword number is non-vanishing, the byte that then window moved right repeats the process of front.If window arrives the end of keyword, and also do not determine optimum m window according to the condition of front, preceding m the character of then directly specifying this keyword is optimum m window, and record shift quantity P is zero.Side-play amount P is used in and determines the adjustment amount L that the text corresponding field begins to locate when calculating matching stage.Shown in Figure 4 is the situation of having determined skip list and antistop list behind the optimum m window of keyword " cooperation ".As can be seen, when optimum m window was determined, the skip list list item of the data block correspondence that last B character formed in this window and the jump value in the antistop list list item were by zero setting, and this keyword also has been associated with in the corresponding antistop list list item.
Determined after the optimum m window of all keywords, needed also to handle one by one that other all lengths in optimum m window are the data block of B in each keyword, revised their jump values in skip list and the corresponding list item of antistop list.Shown in Figure 5 is to set keyword " coo Peration" skip list and antistop list in the optimum m window before other data block jump values, shown in Figure 6 is to set keyword " coo Peration" flow process of other data block jump values in the optimum m window.The underscore of " peration " represents that this is the best m window of this keyword.This method is carried out first Hash operation with the Far Left of the size data block window that is B from optimum m window to data in the B window, with the cryptographic hash retrieval skip list that obtains, and revises the jump value of corresponding skip list list item.The jump value calculating method is: establishing this B byte data piece and the optimum m window side-play amount between beginning to locate is j, and the jump value of this data block is m-B-j, and what the jump value of corresponding skip list list item was then got is one less in this value and this list item initial value.If the jump value of this list item is zero, then above-mentioned B byte data piece is carried out the secondth Hash operation, retrieve skip list with the cryptographic hash that obtains, and revise the jump value of corresponding skip list list item.The jump value calculating method is identical with the computing method of top skip list jump value.Shown in Figure 7 is the situation of having determined in the optimum m window of keyword " cooperation " skip list and antistop list after other data block jump values.After handling all keywords according to top method, just generated complete skip list and antistop list, the keyword pretreatment stage finishes.
In matching process, the window that the present invention is B with a size at first places the place that begins of text or Web content, data block in the window is carried out the 3rd Hash operation, what the 3rd Hash operation here and first Hash operation of pretreatment stage were used is same hash function, with the cryptographic hash retrieval skip list that obtains, if corresponding skip list list item jump value is non-vanishing, then according to this jump value moving window and repeat above matching process; If the jump value is zero, then this data block is carried out the 4th Hash, what the 4th Hash operation here and second Hash operation of pretreatment stage were used is same hash function, with the cryptographic hash search key table that obtains, if this list item is associated with one or more keyword that may mate, then successively corresponding field in these keywords and the text is mated, calculate adjustment amount L during coupling earlier, L=m-B+P, P are the optimum m window offset amount of the current keyword that is compared.If the distance between B window and text begin to locate in the current text is T, be the alignment at T-L place then with side-play amount in first character of keyword that is compared and the text, compare one by one.Then to export matched position be side-play amount T-L to coupling, and the keyword of coupling number.Relevant keyword all relatively intacter after, be worth to come mobile B window according to the jump of this list item, arrive the least significant end of text up to the B window, matching process finishes.

Claims (2)

1, a kind of large-scale and multi-key word matching method that is used for text or network content analysis is characterized in that this method comprises following each step:
(1) determine in the keyword set length m of short keyword, m is the positive integer more than or equal to 4, sets up skip list and antistop list;
(2) window that is m with a size places the place that begins of above-mentioned first keyword of keyword set, the data block that last B the character that enters in the window formed is carried out first Hash operation, wherein B is 3 or 4, retrieve above-mentioned skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then window is moved a character, repeating step (2), if zero, then enter step (3);
(3) data block that last B character in the above-mentioned window formed is carried out second Hash operation, retrieve above-mentioned antistop list with the cryptographic hash that obtains, if the keyword number of corresponding list item is non-vanishing, then window is moved a character, repeating step (2), if zero, then current window is optimum m window, and write down optimum m window and this keyword side-play amount P between beginning to locate, the value of P equals window and moves number of times, if window arrives first keyword low order end, preceding m the character of then specifying this keyword is optimum m window, and record shift quantity is zero;
(4) repeating step (2) and (3) obtain optimum m window and corresponding offset thereof in each keyword;
(5) be the data block of B according to all length in the optimum m window of all keywords in the above-mentioned keyword set and each window, revise the jump value in above-mentioned skip list and the antistop list, and keyword is associated with in the antistop list list item, obtain analyzing with skip list and analyze and use antistop list;
(6) window that is B with a size places the place that begins of text to be analyzed or Web content, data in the window are carried out the 3rd Hash operation, retrieve above-mentioned analysis skip list with the cryptographic hash that obtains, if the jump of corresponding list item value is non-vanishing, then according to jump value moving window, if the jump value is zero, then this data block is carried out the 4th Hash operation, retrieve above-mentioned analysis antistop list with the cryptographic hash that obtains, if be associated with one or more keywords that may mate in the corresponding list item, then successively keyword and text to be analyzed or the respective field in the Web content are carried out character relatively, and, arrive the text to be analyzed or the least significant end of Web content up to window according to the jump value moving window in this antistop list list item.
2, the method for claim 1 is characterized in that wherein saidly successively keyword and text to be analyzed or the respective field in the Web content being carried out character process relatively, may further comprise the steps:
(1) establishing the distance that first character of current B window and text to be analyzed or Web content begin to locate is T, obtain field to be compared in text to be analyzed or the Web content begin locate with text to be analyzed or Web content begin to locate apart from T-L, be the position that field to be compared begins to locate, L=m-B-P wherein, P are the side-play amount of the optimum m window of first keyword;
(2) first character of first keyword and field to be compared are begun the place align after one by one character compare, if coupling is then exported the keyword number of matched position T-L and this keyword, and is entered step (3), if do not match, then directly enter step (3);
(3) repeating step (1), (2), up to relevant keyword relatively finish.
CNB200710065392XA 2007-04-13 2007-04-13 Large-scale and multi-key word matching method for text or network content analysis Expired - Fee Related CN100452055C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200710065392XA CN100452055C (en) 2007-04-13 2007-04-13 Large-scale and multi-key word matching method for text or network content analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200710065392XA CN100452055C (en) 2007-04-13 2007-04-13 Large-scale and multi-key word matching method for text or network content analysis

Publications (2)

Publication Number Publication Date
CN101030221A true CN101030221A (en) 2007-09-05
CN100452055C CN100452055C (en) 2009-01-14

Family

ID=38715567

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200710065392XA Expired - Fee Related CN100452055C (en) 2007-04-13 2007-04-13 Large-scale and multi-key word matching method for text or network content analysis

Country Status (1)

Country Link
CN (1) CN100452055C (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398820B (en) * 2007-09-24 2010-11-17 北京启明星辰信息技术股份有限公司 Large scale key word matching method
CN101329680B (en) * 2008-07-17 2010-12-08 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface
CN101420440B (en) * 2008-12-05 2011-08-24 华为技术有限公司 String matching processing method and apparatus
CN101714166B (en) * 2009-10-30 2011-12-28 清华大学 Method and system for testing performance of large-scale multi-keyword precise matching algorithm
CN103186669A (en) * 2013-03-21 2013-07-03 厦门雅迅网络股份有限公司 Method for rapidly filtering key word
CN101754151B (en) * 2009-12-08 2013-07-10 武汉虹信通信技术有限责任公司 Bidirectional circulation searching and display device based on sliding window and method thereof
CN105468597A (en) * 2014-08-14 2016-04-06 腾讯科技(北京)有限公司 Method and device for acquiring jump distance
US9330085B2 (en) 2010-12-15 2016-05-03 International Business Machines Corporation Assisting users to generate desired meme in document
CN105574038A (en) * 2014-10-16 2016-05-11 阿里巴巴集团控股有限公司 Text content recognition rate test method and device based on anti-recognition rendering
CN103793474B (en) * 2014-01-04 2017-01-11 北京理工大学 Knowledge management oriented user-defined knowledge classification method
CN109145283A (en) * 2017-06-17 2019-01-04 黄冈 A kind of sensitive information detection method of artificial intelligence
CN109643309A (en) * 2016-09-08 2019-04-16 华为技术有限公司 System and method for executing range query to jump list data structure
CN110543548A (en) * 2019-08-20 2019-12-06 国网江苏省电力有限公司电力科学研究院 multi-keyword rapid matching method and matching system for communication content of Internet of things equipment
CN111476037A (en) * 2020-04-14 2020-07-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN112836008A (en) * 2021-02-07 2021-05-25 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1027183A (en) * 1997-04-11 1998-01-27 Hitachi Ltd Method and device for data registration
US20030084298A1 (en) * 2001-10-25 2003-05-01 Messerges Thomas S. Method for efficient hashing of digital content
CN100485664C (en) * 2002-04-10 2009-05-06 中兴通讯股份有限公司 Automatic state machine searching and matching method of multiple key words
JP4363868B2 (en) * 2002-08-23 2009-11-11 株式会社東芝 Search keyword analysis program, system and method

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398820B (en) * 2007-09-24 2010-11-17 北京启明星辰信息技术股份有限公司 Large scale key word matching method
CN101329680B (en) * 2008-07-17 2010-12-08 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface
CN101420440B (en) * 2008-12-05 2011-08-24 华为技术有限公司 String matching processing method and apparatus
CN101714166B (en) * 2009-10-30 2011-12-28 清华大学 Method and system for testing performance of large-scale multi-keyword precise matching algorithm
CN101754151B (en) * 2009-12-08 2013-07-10 武汉虹信通信技术有限责任公司 Bidirectional circulation searching and display device based on sliding window and method thereof
US9330085B2 (en) 2010-12-15 2016-05-03 International Business Machines Corporation Assisting users to generate desired meme in document
US10146755B2 (en) 2010-12-15 2018-12-04 International Business Machines Corporation Assisting users to generate desired meme in document
US9946697B2 (en) 2010-12-15 2018-04-17 International Business Machines Corporation Assisting users to generate desired meme in document
CN103186669B (en) * 2013-03-21 2018-07-06 厦门雅迅网络股份有限公司 Keyword fast filtering method
CN103186669A (en) * 2013-03-21 2013-07-03 厦门雅迅网络股份有限公司 Method for rapidly filtering key word
CN103793474B (en) * 2014-01-04 2017-01-11 北京理工大学 Knowledge management oriented user-defined knowledge classification method
CN105468597A (en) * 2014-08-14 2016-04-06 腾讯科技(北京)有限公司 Method and device for acquiring jump distance
CN105468597B (en) * 2014-08-14 2020-09-25 腾讯科技(北京)有限公司 Method and device for acquiring jump distance
CN105574038A (en) * 2014-10-16 2016-05-11 阿里巴巴集团控股有限公司 Text content recognition rate test method and device based on anti-recognition rendering
CN105574038B (en) * 2014-10-16 2019-03-01 阿里巴巴集团控股有限公司 Content of text discrimination test method and device based on anti-identification rendering
CN109643309A (en) * 2016-09-08 2019-04-16 华为技术有限公司 System and method for executing range query to jump list data structure
CN109643309B (en) * 2016-09-08 2021-02-12 华为技术有限公司 System and method for performing range queries on skip list data structures
US10990628B2 (en) 2016-09-08 2021-04-27 Huawei Technologies Co., Ltd. Systems and methods for performing a range query on a skiplist data structure
CN109145283A (en) * 2017-06-17 2019-01-04 黄冈 A kind of sensitive information detection method of artificial intelligence
CN109145283B (en) * 2017-06-17 2022-03-15 黄冈 Artificial intelligent sensitive information detection method
CN110543548A (en) * 2019-08-20 2019-12-06 国网江苏省电力有限公司电力科学研究院 multi-keyword rapid matching method and matching system for communication content of Internet of things equipment
CN110543548B (en) * 2019-08-20 2022-08-19 国网江苏省电力有限公司电力科学研究院 Multi-keyword rapid matching method and matching system for communication content of Internet of things equipment
CN111476037A (en) * 2020-04-14 2020-07-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN111476037B (en) * 2020-04-14 2023-03-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN112836008A (en) * 2021-02-07 2021-05-25 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data
CN112836008B (en) * 2021-02-07 2023-03-21 中国科学院新疆理化技术研究所 Index establishing method based on decentralized storage data

Also Published As

Publication number Publication date
CN100452055C (en) 2009-01-14

Similar Documents

Publication Publication Date Title
CN100452055C (en) Large-scale and multi-key word matching method for text or network content analysis
CN101398820B (en) Large scale key word matching method
CN101359325B (en) Multi-key-word matching method for rapidly analyzing content
US20140019486A1 (en) Logic Content Processing for Hardware Acceleration of Multi-Pattern Search
CN107918604B (en) Chinese word segmentation method and device
EP2287756B1 (en) Systems and methods for efficient keyword spotting in communication traffic
CN87100829A (en) The method and apparatus that is used to retrieve
CN101149739A (en) Internet faced sensing string digging method and system
CN103412858A (en) Method for large-scale feature matching of text content or network content analyses
CN103186669B (en) Keyword fast filtering method
CN102750379B (en) Fast character string matching method based on filtering type
CN102184197A (en) Regular expression matching method based on smart finite automaton (SFA)
CN101369278B (en) Approximate adaptation method and apparatus
CN101364237A (en) Multi-keyword matching method and device
CN106708956B (en) A kind of HTTP data matching method based on more URL rule sets
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
CN108509505A (en) A kind of character string retrieving method and device based on subregion even numbers group Trie
CN107515931A (en) A kind of duplicate data detection method based on cluster
CN112527948A (en) Data real-time duplicate removal method and system based on sentence-level index
CN101075260A (en) Method and module for extracting summary
CN102955812A (en) Method and device for building index database as well as method and device for querying
CN1776688A (en) Word data searching method
Yuan et al. An improved pattern matching algorithm
WO2011073680A1 (en) Improvements relating to hash tables
CN110019637B (en) Sorting algorithm for standard document retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090114

Termination date: 20110413