CN101714166A

CN101714166A - Method and system for testing performance of large-scale multi-keyword precise matching algorithm

Info

Publication number: CN101714166A
Application number: CN200910236817A
Authority: CN
Inventors: 薛一波; 李雪
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2009-10-30
Filing date: 2009-10-30
Publication date: 2010-05-26
Anticipated expiration: 2029-10-30
Also published as: CN101714166B

Abstract

The invention provides a system for testing performance of a large-scale multi-keyword precise matching algorithm. The system comprises a test data generating module and a keyword set preprocessing performance test module, wherein the test data generating module specifically comprises a random keyword generating sub-module, a random text data generating sub-module and a sub-module for generating a text to be matched; and the keyword set preprocessing performance test module specifically comprises a matching algorithm preprocessing interface calling sub-module and a test information generating sub-module. The method and the system solve the problems of interface standards and interoperation access between different network information security devices, realize cooperative work and linkage between the network information security devices and finally realize seamless integration of the network information security devices, and can test the performance indexes of various multi-keyword precise matching algorithms.

Description

A kind of performance test methods of large-scale multi-keyword precise matching algorithm and system

Technical field

The present invention relates to computer data handling property field tests, relate in particular to a kind of performance test methods of large-scale multi-keyword precise matching algorithm.

Background technology

The multi-key word coupling is called the multi-mode coupling again, is one of basic problem in the computer science.Its problem that need solve is exactly to judge the position of the arbitrary patterns that occurs in text to be measured or the Web content quickly and accurately.The application of multi-mode matching technique is very extensive, except the network safety filed such as fire wall, intrusion detection and defence, virus detection and Web content filtration of being used widely, also expand to other subject and field, the gene order detection in the middle of for example information management, network search engines and the bioinformatics etc.Therefore, research and development multi-key word coupling and correlation technique thereof have very strong science and practical significance, and the science and the industry of being correlated with are paid close attention to.

There have been many classic algorithm in the multi-key word matching technique, based on the Wu-Manber algorithm that jumps, based on the Aho-Corasick algorithm of finite-state automata thought and AC-BM algorithm, based on SBOM algorithm of factor mode or the like.In the last few years, along with application requirements is constantly accelerated to the continuous increase in keyword quantity ground with to processing speed ground, a lot of improved multi-key word matching algorithms have been proposed again.So many multi-key word matching algorithm, does its Performance evaluation criterion all have those? the multi-key word matching algorithm generally comprises two stages: pretreatment stage and search phase.The pretreatment stage of each matching algorithm generally is the pre-service that will finish keyword set; Because each matching algorithm difference, the pre-service work that its pretreatment stage will be done is different, mainly is exactly to set up three tables as the Wu-Manber algorithm at pretreatment stage: skip list, Hash table and prefix table; And the Aho-Corasick algorithm is to set up finite-state automata.Pretreatment stage only need be carried out once, no longer changes in case keyword set is just decided.Therefore, be pretreatment time and storage space occupancy in the main Performance evaluation criterion of pretreatment stage.The search phase of algorithm is mainly finished the coupling work to input text or real time data, and the search phase, just the matching speed of algorithm was the main evaluation criterion in this stage to the processing speed of input text or real time data.So generally speaking, the Performance evaluation criterion of multi-key word matching algorithm mainly is exactly that matching speed, pretreatment time and storage space take situation.

In the accurate matching algorithm of existing multiple key, the algorithm that has has good matching speed, but along with the increase of keyword, storage space consumption is exponential growth, as the Aho-Corasick algorithm; Though the storage space that the algorithm that has consumes can be accepted, pretreatment time is longer, and along with the continuous increase of keyword, pretreatment time reaches unacceptable degree, as the SBOM algorithm; The matching algorithm pretreatment time, space hold and the matching speed that have are all good, but exist the worst case of algorithm, and algorithmic match speed is very low when worst case occurs, as the Wu-Manber algorithm.Just qualitatively each algorithm is carried out general evaluation, performance evaluation that neither one is quantitative and comparison above.At different application, to matching algorithm time and spatial character require differently, in general, most time and the spatial characters that all can take all factors into consideration matching algorithm of using are selected only matching algorithm.So for the application choice matching algorithm or when investigating new improvement algorithm, the performance of each matching algorithm how relatively? how about estimate an algorithm and be better than other each matching algorithms? up to the present go back the unified test evaluation method of neither one.

Summary of the invention

(1) technical matters that will solve

The objective of the invention is to overcome the deficiencies in the prior art, a kind of performance test methods and system of unified large-scale multi-keyword precise matching algorithm is provided, it can be tested the performance index of various multi-keyword precise matching algorithms.

(2) technical scheme

At above problem, the present invention propose a kind of Performance Test System of large-scale multi-keyword precise matching algorithm, described system comprises as lower module:

F1: the test data generation module specifically comprises:

F11: keyword generates submodule at random, is used to generate keyword set at random;

F12: the random text data generate submodule, are used to generate the random text data;

F13: text generation submodule to be matched, be used for keyword set is inserted into text data, produce text to be matched;

F2: keyword set pre-service performance test module specifically comprises:

F21: matching algorithm pre-service interface interchange submodule is used for calling the pre-service interface of matching algorithm by general matching algorithm calling interface;

F22: detecting information generates submodule, be used for keyword set as input file, carry out and generate the keyword related data structure, the key message of statistic algorithm result, described key message comprise the maximum memory information that the data structure of pretreatment time and keyword generation takies;

Wherein, this system also comprises as lower module:

F3: the search performance test module of matching algorithm specifically comprises:

F31: matching algorithm search utility interface interchange submodule is used for calling the search utility interface of matching algorithm to be measured by general matching algorithm calling interface;

F32: search utility scanning submodule, be used for the data structure that finishes the back generation through module F2 processing carrying out the search utility of matching algorithm to be measured as input, treat the matched text file and scan;

F33: detecting information generates submodule, be used for writing down keyword numbering that text to be matched occurs and the position that in text, occurs, these information are saved in the output destination file, simultaneously the maximum memory information of using in record searching time and the search procedure;

Wherein, this system also comprises as lower module:

F4: verification search result and generation statistical report module specifically comprise:

Statistics generates submodule, be used for after the processing of module F2 and F3, expected results data message and actual test result data are compared as input, the correctness of verification algorithm, the performance information that produces after the processing of module F2 and F3 is together as input then, adds up and outputs test result.。

(3) beneficial effect

Adopt the Performance Test System of large-scale multi-keyword precise matching algorithm of the present invention, can produce and use keyword set to come different large-scale and multi-key word matching algorithms is tested with text data to be matched with different qualities, because the present invention has set up a unified architecture platform, the accurate matching algorithm of all multiple keys can be tested by this platform, and the performance to various algorithms, design, efficient that so just can be fair and reasonable be done quantitative evaluation.

Description of drawings

Fig. 1 is the assessment test platform frame diagram among the present invention;

Fig. 2 is a test data generation module principle assumption diagram among the present invention;

Fig. 3 is a matching algorithm test module illustraton of model among the present invention.

Embodiment

The performance test methods and the system of a kind of large-scale multi-keyword precise matching algorithm that the present invention proposes are described as follows in conjunction with the accompanying drawings and embodiments.Following embodiment only is used to illustrate the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; under the situation that does not break away from the spirit and scope of the present invention; can also make various variations and modification; therefore all technical schemes that are equal to also belong to category of the present invention, and scope of patent protection of the present invention should be limited by each claim.

Be illustrated in figure 1 as the accurate matching algorithm assessment of general multiple key test platform frame diagram, the test and appraisal platform is made up of two parts, and test data produces part and matching algorithm performance test part.Test data produces part to be made up of three submodules, carries out the function that generates keyword set and text data to be matched; The matching algorithm performance test partly comprises three submodules and a general matching algorithm calling interface.

The present invention is the assessment test platform that is used for the unified multi-key word matching algorithm of text data or network content analysis.Concrete enforcement comprises that two steps, the first step are that test data produces the stage, comprise the generation of keyword set and text data to be matched.Second step was the matching algorithm test phase, carried out matching algorithm by general-purpose interface, tested the pretreatment stage of matching algorithm and the performance index of search phase, obtained the performance situation of each matching algorithm reality according to the performance index that come out.Describe the particular content in each stage of the present invention below in detail.

At first be that test data produces the stage, this module principle structural drawing is made up of three submodules as shown in Figure 2: keyword set maker, random text maker, test text compositor.The keyword set maker can generate the keyword set with specified characteristic according to the configuration file content of input; The random text maker generates text message at random, produces data source as final text data to be matched; The test text compositor generates the text data to be matched with certain specific character according to configuration file, keyword set and the random text of input.

The detailed step in this stage is as follows:

1, read in configuration information file configure, the parameter and the implication thereof that can be provided with in this configuration file are as shown in table 1;

Table 1 generates the configurable parameter instruction card of data source

Configuration item	Type	The configuration item explanation
Configuration item	Type	The configuration item explanation	??randseed	Integer	Random seed produces the round values that random number is used, and is defaulted as 100
??sigmasize	Integer	The character set size	??randseed	Integer
??sigmasize	Integer	The character set size	??beginASC	Integer	Bebinning character ASCII character (sigmasize+beginASC must be less than or equal to 256)

Configuration item	Type	The configuration item explanation
Configuration item	Type	The configuration item explanation	??Function	Integer	Function=0 then generates keyword set simultaneously and text to be matched (being equivalent to the integrated of function 1 and 2) Function=1 then generates keyword set at random, and filename is specified by the patternfile parameter; Function=2 then reads in keyword by patternfile, generates text to be matched, and the filename of text is provided by the textfile parameter; Function=3 is the unified binary format of this platform with existing keyword set file conversion
??textsizeM	Integer	Size text (unit is MB)	??Function	Integer
??textsizeM	Integer	Size text (unit is MB)	??patternnum	Integer	Keyword number (unit is individual)
??patterntype	Integer	Length keywords in the patterntype=1 keyword set is variable, below five parameters work the patterntype=0 keyword in conjunction with in length keywords identical, be the length of Lminlen appointment	??patternnum	Integer	Keyword number (unit is individual)
??patterntype	Integer		??patternratio	Integer	High frequency byte ratio (%)
??Hminlen	Integer	The minimum length of high frequency keyword	??patternratio	Integer	High frequency byte ratio (%)
??Hminlen	Integer	The minimum length of high frequency keyword	??Hmaxlen	Integer	The maximum length of high frequency keyword
??Lminlen	Integer	The minimum length of other keywords	??Hmaxlen	Integer	The maximum length of high frequency keyword
??Lminlen	Integer	The minimum length of other keywords	?Lmaxlen	Integer	The maximum length of other keywords
?matchtimes	Integer	The matching times of each keyword in text	?Lmaxlen	Integer	The maximum length of other keywords
?matchtimes	Integer	The matching times of each keyword in text	?matchfre	Integer	Above two parameter configuration of keyword number that coupling takes place are represented 20% coupling, and 80% changes matchfre into 80,300% changes matchfre into 100, and matchtimes changes 3 into simultaneously
?textfile	Text	Function=0 and 2 o'clock paths of depositing for the output text, Function=1 is useless	?matchfre	Integer
?textfile	Text		?patternfile	Text	Function=0,1 and 3 o'clock for the output keyword set deposit the path, be the path of depositing of desiring to read in keyword set during Function=2
?verifyfile	Text	Function=0 or 2 o'clock are for output data message to be verified, and are useless during Function=1	?patternfile	Text

2, analysis configuration message file, the different value that is provided with according to function Function item in the configuration file produces different data sources, comprises that keyword set, text to be matched or both produce simultaneously.

2.1 function Function item is 1, then should generate keyword set, the characteristic of keyword set is given by the corresponding entry in the configuration file, and the keyword set of generation saves as binary file, filename is that the patternfile parameter is given in the configuration file, is defaulted as pattern.cfg.

(1) have about the parameter that produces keyword set among the configuration file configure: whether character set size sigmasize, keyword number patternnum, keyword are elongated patterntype, high frequency keyword minimum length Hminlen and maximum length Hmaxlen, high frequency keyword ratio patternratio, other keyword minimum length Lminlen and maximum length Lmaxlen etc.Describe in detail below according among the configure about the method for each parameter generating keyword set of keyword set:

(1.1) judge the patterntype parameter item, whether each length keywords is elongated in this parametric representation keyword set, if patterntype is 0, the length keywords of Chan Shenging is identical so, read in the length of the value of parameter L minlen, change step (1.3) over to as keyword; If patterntype is 1, illustrate that the length keywords that requires to produce does not wait, and changes (1.2) over to and continues to read parameter;

(1.2) reading Hminlen and Hmaxlen parameter, be respectively the minimum and the maximum length of high frequency keyword, read the patternratio parameter, is the number percent that the high frequency keyword occupies in all keywords; Read Lminlen and Lmaxlen parameter,, enter next step then for the minimum of other keyword except that the high frequency keyword with to big length;

(1.3) read the sigmasize parameter and obtain the character set size (should be 1～256) that produces keyword, read the keyword number that the patternnum parameter obtains needs generation;

(1.4) according to each parameter that reads above, and the keyword set file name that provides of patternfile parameter is as the input of keyword set maker module, produce keyword set file at random, the keyword form is " keyword numbering+tab+ keyword+newline ".

The example of parameters that produces the keyword set configuration is as shown in table 2:

Table 2 produces the example of parameters table that keyword set can dispose

Configuration item	Numerical value	The configuration item explanation
Configuration item	Numerical value	The configuration item explanation	??randseed	??100	Random seed is made as default value
??sigmasize	??256	The character set size is made as 256	??randseed	??100	Random seed is made as default value
??sigmasize	??256	The character set size is made as 256	??beginASC	??0	Bebinning character ASCII character (sigmasize+beginASC must be less than or equal to 256)
??Function	??1	Function=1 generates pattern at random	??beginASC	??0
??Function	??1	Function=1 generates pattern at random	??patternnum	??50000	50000 of keyword numbers
??patterntype	??1	Length keywords is variable	??patternnum	??50000	50000 of keyword numbers
??patterntype	??1	Length keywords is variable	??patternratio	??80	High frequency byte ratio (80%), high frequency length keyword accounts for 80% of all keyword sums
??Hminlen	??8	The minimum length 8bytes of high frequency keyword	??patternratio	??80
??Hminlen	??8	The minimum length 8bytes of high frequency keyword	??Hmaxlen	??16	The most fiery length 16bytes of high frequency keyword
??Lminlen	??4	The minimum length 4bytes of other keywords	??Hmaxlen	??16	The most fiery length 16bytes of high frequency keyword
??Lminlen	??4	The minimum length 4bytes of other keywords	??Lmaxlen	??100	The maximum length 100bytes of other keywords
??patternfile	??Pattern.cfg	The file of depositing of output keyword set is called pattern.cfg	??Lmaxlen	??100	The maximum length 100bytes of other keywords

(2) if already present keyword set file is arranged, as virus base file or spam library file, can be made as 3 to the function Function item in the configuration file, be the unified binary format file of this platform with the keyword set file conversion, and filename is specified by the patternfile parameter.

2.2 function Function item is 2, then should generate text to be matched, text to be matched will produce according to the keyword set file of patternfile parameter appointment and other configuration item, generation be binary file, filename is specified by the textfile parameter item, is defaulted as text.dat.

(1) have about the parameter that produces text to be matched among the configuration file configure: matching times matchtimes in text of the size text textsizeM of character set size sigmasize, generation, keyword, the keyword that coupling takes place account for total keyword number percent matchfre, read in the authenticating documents name verifyfile of keyword number patternnum and keyword set filename patternfile, output etc.Describe in detail below according among the configure about the method for each parameter generating text of text to be matched:

(1.1) read the patternfile parameter and obtain the keyword set filename, and open this file;

(1.2) read the matchfre parameter, percent value and the keyword sum given by parameter calculate the keyword number that will extract.If matchfre＜100, directly utilize matchfre and patternnum to calculate, keyword number patternnum=5000 for example, matchfre=20 (the keyword number that coupling takes place account for total keyword number 20%), to from keyword set, randomly draw 1000 so, be used for next step and insert text to be matched; If matchfre=100 indicates keyword all is inserted in the text so, at this moment need to read again matching times matchtimes parameter, promptly every keyword all will be inserted in the text, and inserting number of times is matchtimes time.For example matchtimes=2 is exactly that every keyword inserts the text random site 2 times;

(1.3) read the sigmasize parameter and obtain the character set size (should be 1～256) that produces keyword, read the size text that the textsizeM parameter obtains producing;

(1.4) by random text maker module generation text at random, as the data source that produces last band matched text, the random text size is to set size text textsizeM and will insert the poor of the total size of keyword;

(1.5) by the test text Senthesizer module by the extraction quantity of calculating, randomly draw keyword, then in executing text that (1.4) back produces randomly chosen position insert keyword, insert the keyword numbering of back record insertion and the position of inserting thereof, all keywords numbering and insertion position all are recorded in the filename of verifyfile parameter appointment (default value is toverify.dat), so that with the matching result contrast of matching algorithm output, the correctness of checking matching algorithm.

The configuration parameter example that produces text to be matched is as shown in table 3:

Table 3 produces the example of parameters table that text to be matched can dispose

Configuration item	Numerical value	The configuration item explanation
Configuration item	Numerical value	The configuration item explanation	??randseed	??100	Random seed is made as default value
??sigmasize	??256	The character set size	??randseed	??100	Random seed is made as default value
??sigmasize	??256	The character set size	??beginASC	??0	Bebinning character ASCII character (sigmasize+beginASC must be less than or equal to 256)
??Function	??0	Function=0 then generates keyword set and text to be matched simultaneously	??beginASC	??0

Configuration item	Numerical value	The configuration item explanation
Configuration item	Numerical value	The configuration item explanation	??patternnum	??50000	50000 of keyword numbers
??textsizeM	??64	Size text (64MB)	??patternnum	??50000	50000 of keyword numbers
??textsizeM	??64	Size text (64MB)	??matchtimes	??1	Each keyword coupling of randomly drawing 1 time
??matchfre	??20	The keyword number that coupling takes place accounts for 20% of total keyword number	??matchtimes	??1	Each keyword coupling of randomly drawing 1 time
??matchfre	??20		??textfile	??Text.dat	The file of depositing of output text is called text.dat
??patternfile	??Pattern.cfg	The keyword set file that reads in is called pattern.cfg	??textfile	??Text.dat	The file of depositing of output text is called text.dat
??patternfile	??Pattern.cfg	The keyword set file that reads in is called pattern.cfg	??verifyfile	??Toverify.dat	The authenticating documents of output toverify.dat by name

2.3 function Function item is 0, produces keyword set and text to be matched simultaneously, the filename of keyword set and text to be matched is given by patternfile and textfile parameter respectively, and the characteristic of the two is provided by other parameter item of configuration file.

The implementation of this step is exactly to carry out respectively 2.1 and 2.2 liang of steps, exports keyword set file, text to be matched and verification of correctness file at last.

The matching algorithm performance test stage, matching algorithm performance test modular model figure comprises three submodules and a general matching algorithm calling interface as shown in Figure 3: the general-purpose interface of keyword set pre-service performance test submodule, matching algorithm search phase performance test submodule, test result checking and data statistics submodule and replaceable matching algorithm.Test mainly comprises two stages: the evaluation and test of matching algorithm pretreatment stage and the evaluation and test of matching algorithm search phase.Detailed steps is described as follows:

1, obtains multi-keyword precise matching algorithm, keyword set file and the text to be matched that needs evaluation and test.

2, the evaluation and test of matching algorithm preprocessing part

(2.1) keyword set pre-service performance test submodule calls the pre-service interface of matching algorithm by general matching algorithm calling interface;

(2.2) with keyword set as input file, carry out to need the multi-pattern matching algorithm pretreatment stage of evaluation and test;

(2.3) the matching algorithm pretreatment stage complete after, generate the keyword related data structure, and the key message of statistic algorithm result, these information comprise pretreatment time, the maximum memory information that the data structure that keyword generates takies.

3, the evaluation and test of the search phase of matching algorithm

(3.1) matching algorithm search phase performance test submodule calls the search utility interface of matching algorithm to be measured by general matching algorithm calling interface;

(3.2) carry out the back data structure that generates of end as input with pretreatment stage, carry out the search utility of matching algorithm to be measured, treat the matched text file and scan;

(3.3) keyword numbering that occurs in the record text to be matched and the position that occurs in text are saved in these information in the output destination file, simultaneously the information of using in record searching time and the search procedure such as maximum memory.

4, verification search result and generation statistical report stage

Finish pretreatment stage and after the search phase, test result checking and statistical module compare expected results data message (file of verifyfile parameter appointment) and actual test result data as input, the correctness of verification algorithm, the performance information that matching algorithm pretreatment module and matching algorithm search module are produced is together as input then, verified and report of accessment and test is added up and exported to statistical module by test result.The content of report is as shown in table 4.

If 5 assess test to a plurality of multi-keyword precise matching algorithms respectively, and have formed the report of accessment and test as table 4 respectively,

The performance index report example that these test and appraisal of table 4 platform can produce

Each report of accessment and test can be input to test result checking and statistical module, produce the lateral comparison report of each performance index (comprising that matching algorithm pretreatment time, storage space take and matching speed).

Claims

1. the performance test methods of a large-scale multi-keyword precise matching algorithm is characterized in that, described method comprises the steps:

S1: test data produces step, specifically comprises:

S11: generate keyword set at random;

S12: generate the random text data;

S13: keyword set is inserted in the text data, produces text to be matched;

S2: keyword set pre-service performance test step specifically comprises:

S21:, call the pre-service interface of matching algorithm by general matching algorithm calling interface;

S22: as input file, carry out and generate the keyword related data structure with keyword set, the key message of statistic algorithm result, described key message comprise the maximum memory information that the data structure of pretreatment time and keyword generation takies;

2. the performance test methods of large-scale multi-keyword precise matching algorithm as claimed in claim 1 is characterized in that, described method also comprises the steps:

S3: the search performance testing procedure of matching algorithm specifically comprises:

S31: general matching algorithm calling interface, call the search utility interface of matching algorithm to be measured;

S32: carry out the back data structure that generates of end as input with step S2, carry out the search utility of matching algorithm to be measured, treat the matched text file and scan;

S33: write down keyword numbering that occurs in the text to be matched and the position that in text, occurs, these information are saved in the output destination file, simultaneously the maximum memory information of using in record searching time and the search procedure;

3. the performance test methods of large-scale multi-keyword precise matching algorithm as claimed in claim 1 is characterized in that, described method also comprises the steps:

S4: verification search result and generation statistical report step specifically comprise:

After completing steps S2 and S3, expected results data message and actual test result data are compared as input, the correctness of verification algorithm, the performance information that step S2 and S3 are produced is together as importing then, adds up and outputs test result.

4. the Performance Test System of a large-scale multi-keyword precise matching algorithm is characterized in that, described system comprises as lower module:

F1: the test data generation module specifically comprises:

F2: keyword set pre-service performance test module specifically comprises:

5. the Performance Test System of large-scale multi-keyword precise matching algorithm as claimed in claim 4 is characterized in that, described system also comprises as lower module:

6. the Performance Test System of large-scale multi-keyword precise matching algorithm as claimed in claim 5 is characterized in that, described system also comprises as lower module:

Statistics generates submodule, be used for after the processing of module F2 and F3, expected results data message and actual test result data are compared as input, the correctness of verification algorithm, the performance information that produces after the processing of module F2 and F3 is together as input then, adds up and outputs test result.