WO2016082503A1 - 基于自动机的模式匹配的方法及装置 - Google Patents
基于自动机的模式匹配的方法及装置 Download PDFInfo
- Publication number
- WO2016082503A1 WO2016082503A1 PCT/CN2015/080174 CN2015080174W WO2016082503A1 WO 2016082503 A1 WO2016082503 A1 WO 2016082503A1 CN 2015080174 W CN2015080174 W CN 2015080174W WO 2016082503 A1 WO2016082503 A1 WO 2016082503A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- matching
- mode
- input content
- current input
- shift address
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Definitions
- the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for pattern matching based on an automaton.
- the Wu-Manber algorithm (referred to as the WM algorithm) was proposed by Wu Sheng and Udi Manber in the 1990s, and it evolved based on the single-mode matching algorithm BM.
- WM algorithm uses block character (SHIFT), HASH, prefix table (PREFIX) and other technologies to achieve better matching efficiency.
- Figure 1 shows the data model of the Wu-Manber algorithm.
- the data model consists of three parts, the SHIFT address table, the HASH-PATTERNS table (prefix list), and the PREFIX table, where the offset address is recorded in the SHIFT address table, HASH-PATTERNS table. The correspondence between the HASH value and the mode sub-packet is recorded.
- the embodiments of the present invention provide a method and apparatus for pattern matching based on an automaton, so that the optimized algorithm has a stable matching performance under the premise of ensuring efficiency and correctness.
- a method for pattern matching based on an automaton comprising: searching a SHIFT address table according to a current input content to obtain a SHIFT address value; and determining whether the obtained SHIFT address value is zero. If the SHIFT address value is zero, the HASH value is calculated according to the prefix of the current input content, and the actual matching mode sub-packet is entered by using the HASH value as an index, and the mode sub-grouping is performed according to a preset manner.
- the method before the searching for the SHIFT address table according to the current input content, the method further includes: constructing a SHIFT address table for jumping; hashing one or more mode chains into each HASH bucket, Stored in a chain; each pattern chain in the HASH bucket is used as an input. If there is only one of the pattern chains in the pattern sub-group, the pre-processing of the pattern chain is completed using a brute force algorithm; If the mode number of the mode chain is greater than 1, the automaton algorithm is used to complete the preprocessing of the mode chain.
- the one or more mode chains are hashed into the respective HASH buckets, and are stored in a chain, including:
- the pattern chains are hashed into the same HASH bucket; multiple schema chains in the pattern subgroups in the same HASH bucket are compiled into graphics of the automaton.
- the matching process of the current mode sub-packet is exited.
- the matching of the mode sub-groups according to a preset manner includes: using a brute force algorithm to match the mode sub-packets to find the current input content; if the matching is successful, the current The matching result is added to the matching result set; otherwise, the mode sub-packet matching search is terminated.
- performing the matching search on the mode sub-group according to a preset manner including: performing an matching search on the current mode sub-group by using an auto-machine algorithm, if an output exists in an auto-machine state, Then all the current matching results are added to the matching result set; if the invalid automaton state is encountered, the mode sub-packet matching search of the mode is exited.
- an apparatus for pattern matching based on an automaton comprising: a first matching module configured to search a SHIFT address table according to a current input content to obtain a SHIFT address value; a detecting module, configured to determine whether the obtained SHIFT address value is zero; and the second matching module is configured to: if the SHIFT address value is zero, calculate a HASH value according to the prefix of the current input content, and The HASH value is entered as an index into the actually matched mode sub-packet, and the mode sub-packet is matched and searched by a brute force algorithm or an automaton algorithm, and after the mode sub-group search is completed, the current input content is forwardly shifted by one.
- the second detecting module is configured to determine whether the current input content is completely scanned, if And outputting a matching result set; otherwise, the first matching module is triggered to search the SHIFT address table according to the current input content.
- the apparatus further includes: a pre-processing module, configured to construct a SHIFT address table for jumping; hashing one or more mode chains into each HASH bucket, and storing in a chain; Each pattern chain in the bucket is used as input. If there is only one of the pattern chains in the pattern subgroup, the brute force algorithm is used to complete the Preprocessing of the pattern chain; if the number of patterns of the pattern chain in the pattern subgroup is greater than 1, the automaton algorithm is used to complete the preprocessing of the pattern chain.
- a pre-processing module configured to construct a SHIFT address table for jumping
- hashing one or more mode chains into each HASH bucket and storing in a chain
- Each pattern chain in the bucket is used as input. If there is only one of the pattern chains in the pattern subgroup, the brute force algorithm is used to complete the Preprocessing of the pattern chain; if the number of patterns of the pattern chain in the pattern subgroup is greater than 1, the automaton algorithm is used to complete the preprocessing of the pattern chain.
- the second matching module is configured to use the brute force algorithm to match the mode sub-packets to find the current input content; if the matching is successful, add the current matching result to the matching result set; otherwise, exit the present The mode sub-packet matches the lookup.
- the second matching module is configured to perform matching search on the current mode sub-packet by using an automaton algorithm, and if there is output in the automaton state, all current matching results are added to the matching result set; If a failed automaton state is encountered, the mode sub-packet matching lookup is exited this time.
- the conflict chain mode in the HASH bucket is compiled into a graphic structure based on the automaton algorithm, and the query of the prefix table is omitted, and the optimized algorithm has a stability under the premise of ensuring efficiency and correctness.
- the matching performance can make the pattern matching time complexity in the conflict chain linear, regardless of the number of patterns in the conflict chain.
- Figure 1 is a data model of the Wu-Manber algorithm
- FIG. 2 is a flowchart of a method for pattern matching based on an automaton in an embodiment of the present invention
- FIG. 5 is a scanning process when the sub-grouping mode is a brute force type according to an embodiment of the present invention
- FIG. 7 is a schematic structural diagram of an apparatus for pattern matching based on an automaton according to an embodiment of the present invention.
- the WM algorithm adopts techniques such as block character (SHIFT), HASH, and prefix table (PREFIX) to achieve better matching efficiency.
- SHIFT block character
- HASH high-density polyethylene
- PREFIX prefix table
- FIG. 2 it is a flowchart of a method for pattern matching based on an automaton in an embodiment of the present invention, and the specific steps are as follows:
- Step S201 Find a SHIFT address table according to the current input content, and obtain a SHIFT address value.
- a SHIFT address value (offset address value) is recorded in the SHIFT address table (offset address table) for indicating the offset address of the currently input content. As shown in FIG. 3, if the current input content is "ab", the SHIFT address value corresponding to the searched in the SHIFT address table is "0".
- Step S203 it is determined whether the obtained SHIFT address value is zero; if the SHIFT address value is zero, proceeds to step S205; otherwise, proceeds to step S207;
- Step S205 calculating the HASH value according to the prefix of the current input content, and using the HASH value as an index to enter the actual matching mode sub-packet, performing matching search on the mode sub-packet according to a preset manner, and then proceeding to step S209;
- the pattern sub-group can be matched and searched by a brute force algorithm or an automaton algorithm.
- a brute force algorithm or an automaton algorithm.
- the pattern sub-grouping (abcabe) can be matched and searched by the brute force algorithm; If the current input content is "de”, the HASH value calculated according to the prefix of the current input content is "oxde”, and since the mode sub-grouping (abcde and bcbde) is compiled into the graphics of the automaton, the automaton algorithm can be used.
- a matching lookup is performed on the pattern sub-packets (abcde and bcbde).
- the mode sub-packet if the mode sub-packet is empty, it means that the content cannot be matched, and no processing is performed, and the matching process of the current mode sub-packet is directly exited. As shown in FIG. 3, the mode sub-packet corresponding to the HASH value of oxfffe is empty.
- Step S207 the current input content is shifted backward by the SHIFT address value length unit, and then proceeds to the subsequent step S211;
- Step S209 after the mode sub-group scanning is completed, the current input content is forward shifted by one length unit, and then proceeds to the subsequent step S211;
- Step S211 it is determined whether the current input content is completely scanned, and if so, proceeds to step S213; otherwise, proceeds to step S201;
- MAX ⁇ length(abcde),length(bcbde) ⁇ represents the maximum mode length.
- FIG. 4 it is a preprocessing process of an algorithm for a pattern chain in an embodiment of the present invention.
- step S401 a SHIFT address table for the jump is constructed.
- the SHIFT address table for the jump can be constructed using the existing block character technique.
- step S403 the mode chain is hashed into each HASH bucket in a prefix manner and stored in a chain.
- the existing HASH technology may be used to hash the mode chain to each HASH bucket in a prefix manner and store in a chain.
- Hash Bucket There may be multiple elements in the same location in the hash table to deal with hash collisions. Thus, each location in the hash table represents a HASH bucket (hash bucket).
- Step S405 taking each mode chain in the HASH bucket as an input. If there is only one mode in the conflict chain, the brute force algorithm is used to complete the mode preprocessing. If the number of modes in the conflict chain is greater than 1, the algorithm is completed by an automaton algorithm. Pattern chain preprocessing. As shown in FIG. 3, the number of collision chains in the sub-packet mode corresponding to the HASH value of 0xde is two.
- the simple chain method is modified to the graphics structure based on the automaton, and for the case where there is only a single mode in the bucket, Use brute force algorithm to save storage space.
- step S501 the mode sub-group is matched to find the current input content by using the brute force algorithm. If the mode end is scanned, the matching is successful, and step S503 is performed. If the unmatched character is encountered, the matching fails, the current scanning is terminated, and the step is performed. S507.
- step S505 if the matching is successful, the current matching result is added to the matching result set.
- step S507 the matching is ended, and the current mode sub-group scanning is exited.
- Step S601 Perform an matching search on the current mode sub-packet by using an automaton algorithm. If there is an output in the automaton state, it indicates that there is a matching successful mode, and step S603 is performed. If a failed automaton state is encountered, the matching is performed. If the failure is completed, the current scan is terminated, and step S605 is performed.
- step S603 there is a matching mode, and all current matching results are added to the matching result set.
- Step S605 the matching ends, and the current mode sub-packet matching search is exited.
- FIG. 7 is a schematic structural diagram of an apparatus for pattern matching based on an automaton according to an embodiment of the present invention.
- the apparatus 700 includes:
- the first matching module 701 is configured to search the SHIFT address table according to the current input content to obtain a SHIFT address value
- the first detecting module 703 is configured to determine whether the obtained SHIFT address value is zero;
- the second matching module 705 is configured to: if the SHIFT address value is zero, calculate a HASH value according to the prefix of the current input content, and enter the actually matched mode sub-group with the HASH value as an index, and the mode is The sub-packet performs a matching search according to a preset manner, and after the mode sub-group search is completed, shifts the current input content forward by one length unit; if the SHIFT address value is not zero, the current input content is After shifting the SHIFT address value by a length unit;
- the second detecting module 707 is configured to determine whether the current input content is completely scanned, and if yes, output a matching result set; otherwise, the first matching module is triggered to search the SHIFT address table according to the current input content.
- the device further includes:
- the pre-processing module is configured to construct a SHIFT address table for jumping; hash one or more pattern chains into each HASH bucket and store them in a chain; use each mode chain in the HASH bucket as an input, if the mode is If there is only one mode chain in the group, the preprocessing of the mode chain is completed by using a brute force algorithm; if the mode number of the mode chain in the mode sub-group is greater than 1, the automaton algorithm is used to complete the mode chain. Pretreatment.
- the second matching module is further configured to use the brute force algorithm to match the mode sub-packets to find the current input content; if the matching is successful, add the current matching result. Match the result set; otherwise, exit the pattern sub-packet match lookup this time.
- the second matching module is configured to perform a matching search on the current mode sub-packet using an automaton algorithm, and if there is an output in the automaton state, all current The matching result is added to the matching result set; if the invalid automaton state is encountered, the mode sub-packet matching search of the mode is exited.
- the conflict chain mode in the HASH bucket is compiled into a graphical structure based on the automaton algorithm, and the query of the prefix table is omitted, and the optimization is ensured under the premise of ensuring efficiency and correctness.
- the latter algorithm has a stable matching performance, which can make the pattern matching time complexity in the conflict chain linear, regardless of the number of patterns in the conflict chain.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明提供了一种基于自动机的模式匹配的方法及装置,方法包括:根据当前输入内容查找SHIFT地址表,得到SHIFT地址值;判断得到的SHIFT地址值是否为零;若SHIFT地址值为零,则根据当前输入内容的前缀计算HASH值,并以HASH值作为索引进入实际匹配的模式子分组,对模式子分组通过蛮力算法或自动机算法进行匹配查找,并且在模式子分组查找完成后,将当前输入内容向前偏移一个长度单位;若SHIFT地址值不为零,则将当前输入内容向后偏移SHIFT地址值个长度单位;判断当前输入内容是否全部扫描完成,若是,则输出匹配结果集;否则跳到根据当前输入内容查找SHIFT地址表的步骤,在保证效率和正确性的前提下,使优化后的算法有一个稳定的匹配性能。
Description
本发明涉及计算机技术领域,尤其涉及一种基于自动机的模式匹配的方法及装置。
Wu-Manber算法(简称WM算法)由吴升和Udi Manber在九十年代提出,它基于单模匹配算法BM演化而来。WM算法采用块字符(SHIFT)、HASH、前缀表(PREFIX)等技术,达到了比较好的匹配效率。图1为Wu-Manber算法的数据模型,该数据模型包括三个部分,SHIFT地址表、HASH-PATTERNS表(前缀列表)和PREFIX表,其中SHIFT地址表中记录有偏移地址,HASH-PATTERNS表记录有HASH值与模式子分组的对应关系。按照原有技术,模式子分组中的模式链abcde、bcbde会被散列到同一个HASH桶内,这样在匹配的最坏情况下需要扫描长度={length(abcde)+length(bcbde)},最好情况下需要扫描长度={模式个数,即每个模式扫描一个字节},同时还需查找PREFIX表。其中length(abcde)表示模式链abcde的长度。
然而,由于HASH的不稳定性,随着模式数量的增加,HASH冲突加剧,它的匹配效率会发生不同程度的退化,无法满足商用需求。
发明内容
为了解决上述技术问题,发明的实施例提供了一种基于自动机的模式匹配的方法及装置,在保证效率和正确性的前提下,使优化后的算法有一个稳定的匹配性能。
依据本发明的一个方面,提供了一种基于自动机的模式匹配的方法,所述方法包括:根据当前输入内容查找SHIFT地址表,得到SHIFT地址值;判断得到的所述SHIFT地址值是否为零;若所述SHIFT地址值为零,则根据所述当前输入内容的前缀计算HASH值,并以所述HASH值作为索引进入实际匹配的模式子分组,对所述模式子分组按照预设的方式进行匹配查找,并且在所述模式子分组查找完成后,将当前输入内容向前偏移一个长度单位;若所述SHIFT地址值不为零,则将当前输入内容向后偏移所述SHIFT地址值个长度单位;判断所述当前输入内容是否全部扫描完成,若是,则输出匹配结果集;否则跳到所述根据当前输入内容查找SHIFT地址表的步骤。
在本发明实施例中,在所述根据当前输入内容查找SHIFT地址表之前,所述方法还包括:构建跳转用的SHIFT地址表;将一个或多个模式链散列到各个HASH桶中,以链式存储;以HASH桶中的每个模式链作为输入,若模式子分组中仅有一条所述模式链,则使用蛮力算法完成所述模式链的预处理;若模式子分组中的所述模式链的模式数大于1,则使用自动机算法完成所述模式链的预处理。
在本发明实施例中,若需要将多个模式链散列到同一个HASH桶内时,所述将一个或多个模式链散列到各个HASH桶中,以链式存储,包括:将多个模式链散列到同一个HASH桶内;将同一个HASH桶内的模式子分组中多个模式链编译为自动机的图形。
在本发明实施例中,若所述模式子分组为空,则退出当前的所述模式子分组的匹配过程。
在本发明实施例中,所述对所述模式子分组按照预设的方式进行匹配查找,包括:使用蛮力算法对所述模式子分组进行匹配查找当前输入内容;如果匹配成功,则将当前匹配结果加入匹配结果集;否则,退出本次所述模式子分组匹配查找。
在本发明实施例中,所述对所述模式子分组按照预设的方式进行匹配查找,包括:使用自动机算法对当前的所述模式子分组进行匹配查找,如果自动机状态中存在输出,则将当前所有匹配结果加入匹配结果集;如果遇到失效的自动机状态,则退出本次所述模式子分组匹配查找。
依据本发明的另一个方面,还提供了一种基于自动机的模式匹配的装置,所述装置包括:第一匹配模块,设置为根据当前输入内容查找SHIFT地址表,得到SHIFT地址值;第一检测模块,设置为判断得到的所述SHIFT地址值是否为零;第二匹配模块,设置为若所述SHIFT地址值为零,则根据所述当前输入内容的前缀计算HASH值,并以所述HASH值作为索引进入实际匹配的模式子分组,对所述模式子分组通过蛮力算法或自动机算法进行匹配查找,并且在所述模式子分组查找完成后,将当前输入内容向前偏移一个长度单位;若所述SHIFT地址值不为零,则将当前输入内容向后偏移所述SHIFT地址值个长度单位;第二检测模块,设置为判断所述当前输入内容是否全部扫描完成,若是,则输出匹配结果集;否则触发所述第一匹配模块根据当前输入内容查找SHIFT地址表。
在本发明实施例中,所述装置还包括:预处理模块,设置为构建跳转用的SHIFT地址表;将一个或多个模式链散列到各个HASH桶中,以链式存储;以HASH桶中的每个模式链作为输入,若模式子分组中仅有一条所述模式链,则使用蛮力算法完成所
述模式链的预处理;若模式子分组中的所述模式链的模式数大于1,则使用自动机算法完成所述模式链的预处理。
在本发明实施例中,所述第二匹配模块设置为使用蛮力算法对所述模式子分组进行匹配查找当前输入内容;如果匹配成功,则将当前匹配结果加入匹配结果集;否则,退出本次所述模式子分组匹配查找。
在本发明实施例中,所述第二匹配模块设置为使用自动机算法对当前的所述模式子分组进行匹配查找,如果自动机状态中存在输出,则将当前所有匹配结果加入匹配结果集;如果遇到失效的自动机状态,则退出本次所述模式子分组匹配查找。
通过本发明的实施例,将HASH桶内的冲突链模式编译为基于自动机算法的图形结构,并省略前缀表的查询,在保证效率和正确性的前提下,使优化后的算法有一个稳定的匹配性能,可以使冲突链中的模式匹配时间复杂度达到线性,而与冲突链中模式条数无关。
图1为Wu-Manber算法的数据模型;
图2为本发明的实施例中基于自动机的模式匹配的方法的流程图;
图3为本发明的实施例中改进后的算法数据模型;
图4为本发明的实施例中算法对模式链的预处理过程;
图5为本发明的实施例中子分组模式为蛮力类型时的扫描过程;
图6为本发明的实施例中子分组模式为自动机类型时的扫描过程;以及
图7为本发明的实施例中基于自动机的模式匹配的装置的结构示意图。
在现有技术中,WM算法采用块字符(SHIFT)、HASH、前缀表(PREFIX)等技术,达到了比较好的匹配效率。但由于HASH的不稳定性,随着模式数量的增加,HASH冲突加剧,它的匹配效率会发生不同程度的退化,无法满足商用需求。例如:当SHIFT地址为零时,HASH桶中冲突链的每个节点都需要查找一次PREFIX表,性能上存在消耗。当模式数量增大到一定程度,如达到万条级规模后,模式的HASH冲突会加剧,
造成匹配性能下降明显。由于上述的原因,WM算法在真实的商用场景下,不能展现稳定的匹配性能。
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
如图2所示,为本发明的实施例中基于自动机的模式匹配的方法的流程图,具体步骤如下:
步骤S201、根据当前输入内容查找SHIFT地址表,得到SHIFT地址值;
在本实施例中,SHIFT地址表(偏移地址表)中记录有SHIFT地址值(偏移地址值),用于表示当前输入内容的偏移地址。如图3所示,若当前输入内容为“ab”,则在SHIFT地址表中对应查找到的SHIFT地址值为“0”。
步骤S203、判断得到的SHIFT地址值是否为零;若SHIFT地址值为零,进入步骤S205;否则,进入步骤S207;
步骤S205、根据当前输入内容的前缀计算HASH值,并以HASH值作为索引进入实际匹配的模式子分组,按照预设的方式对模式子分组进行匹配查找,然后进入步骤S209;
具体地,在步骤S205中,可以通过蛮力算法或自动机算法对模式子分组进行匹配查找。如图3所示,若当前输入内容为“ab”,则根据当前输入内容的前缀计算得到的HASH值为“oxab”,则可以通过蛮力算法对该模式子分组(abcabe)进行匹配查找;若当前输入内容为“de”,则根据当前输入内容的前缀计算得到的HASH值为“oxde”,由于该模式子分组(abcde和bcbde)被编译为自动机的图形,因此可以通过自动机算法对该模式子分组(abcde和bcbde)进行匹配查找。
在本发明的实施例中,若模式子分组为空,则说明内容不可能发生匹配,则不做任何处理,直接退出当前模式子分组的匹配过程。如图3所示,HASH值为oxfffe对应的模式子分组为空。
步骤S207、将当前输入内容向后偏移SHIFT地址值个长度单位,然后进入后步骤S211;
步骤S209、在模式子分组扫描完成后,将当前输入内容向前偏移一个长度单位,然后进入后步骤S211;
步骤S211、判断当前输入内容是否全部扫描完成,若是,进入步骤S213;否则,进入步骤S201;
步骤S213、输出匹配结果集。
如图3所示,为本发明的实施例中改进后的算法数据模型,可选地,在本实施例中,模式链(abcde、bcbde)被散列到同一个HASH桶内后进一步编译为自动机的图形,这样在匹配的最坏情况下扫描长度=MAX{length(abcde),length(bcbde)},最好情况下需扫描长度=1。其中MAX{length(abcde),length(bcbde)}表示最大的模式长度。由此可以看出,改进后的算法,对于冲突链的比较具有良好的稳定性,不会随着冲突链的增加而导致扫描长度的增长。
如图4所示,为本发明的实施例中算法对模式链的预处理过程;
步骤S401,构建跳转用的SHIFT地址表。
具体地,可以使用现有块字符技术,构建跳转用的SHIFT地址表。
步骤S403,将模式链以前缀方式散列到各个HASH桶中,以链式存储。
可选地,可以使用现有HASH技术,将模式链以前缀方式散列到各个HASH桶中,以链式存储。
HASH桶(Hash Bucket):哈希表中同一个位置可能存有多个元素,以应对哈希冲突问题。这样,哈希表中的每个位置表示一个HASH桶(哈希桶)。
步骤S405,以HASH桶中的每个模式链作为输入,若冲突链中仅有一条模式,那么使用蛮力算法完成模式预处理,若冲突链中的模式数大于1,那么以自动机算法完成模式链预处理。如图3所示,HASH值为0xde对应的子分组模式中的冲突链的数量为2个。
在本发明的实施例中,通过改变WM算法对HASH桶中模式冲突链的处理方式,由简单的链式方式修改为基于自动机的图形结构,而对于桶内只存在单条模式的情况,仍然采用蛮力算法,以节省存储空间。
如图5所示,为本发明的实施例中子分组模式为蛮力类型时的扫描过程,步骤如下:
步骤S501,使用蛮力算法对模式子分组进行匹配查找当前输入内容,如果扫描到模式尾,则说明匹配成功,执行步骤S503,如果遇到不匹配字符则说明匹配失败,终止当前扫描,执行步骤S507。
步骤S505,此次匹配成功,则将当前匹配结果加入匹配结果集。
步骤S507,此次匹配结束,退出本次模式子分组扫描。
如图6所示,为本发明的实施例中子分组模式为自动机类型时的扫描过程,具体步骤如下:
步骤S601,使用自动机算法对当前的所述模式子分组进行匹配查找,如果自动机状态中存在输出,则说明有匹配成功的模式,执行步骤S603,如果遇到失效的自动机状态则说明匹配失败,终止当前扫描,执行步骤S605。
步骤S603,有匹配的模式,将当前所有匹配结果加入匹配结果集。
步骤S605,此次匹配结束,退出本次模式子分组匹配查找。
如图7所示,为本发明的实施例中基于自动机的模式匹配的装置的结构示意图,该装置700包括:
第一匹配模块701,设置为根据当前输入内容查找SHIFT地址表,得到SHIFT地址值;
第一检测模块703,设置为判断得到的所述SHIFT地址值是否为零;
第二匹配模块705,设置为若所述SHIFT地址值为零,则根据所述当前输入内容的前缀计算HASH值,并以所述HASH值作为索引进入实际匹配的模式子分组,对所述模式子分组按照预设的方式进行匹配查找,并且在所述模式子分组查找完成后,将当前输入内容向前偏移一个长度单位;若所述SHIFT地址值不为零,则将当前输入内容向后偏移所述SHIFT地址值个长度单位;
第二检测模块707,设置为判断所述当前输入内容是否全部扫描完成,若是,则输出匹配结果集;否则触发所述第一匹配模块根据当前输入内容查找SHIFT地址表。
可选地,在本发明的另一个实施例中,装置还包括:
预处理模块,设置为构建跳转用的SHIFT地址表;将一个或多个模式链散列到各个HASH桶中,以链式存储;以HASH桶中的每个模式链作为输入,若模式子分组中仅有一条所述模式链,则使用蛮力算法完成所述模式链的预处理;若模式子分组中的所述模式链的模式数大于1,则使用自动机算法完成所述模式链的预处理。
可选地,在本发明的另一个实施例中,所述第二匹配模块进一步设置为使用蛮力算法对所述模式子分组进行匹配查找当前输入内容;如果匹配成功,则将当前匹配结果加入匹配结果集;否则,退出本次所述模式子分组匹配查找。
可选地,在本发明的另一个实施例中,所述第二匹配模块设置为使用自动机算法对当前的所述模式子分组进行匹配查找,如果自动机状态中存在输出,则将当前所有匹配结果加入匹配结果集;如果遇到失效的自动机状态,则退出本次所述模式子分组匹配查找。
以上所述是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明所述原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。
如上所述,通过上述实施例及优选实施方式,将HASH桶内的冲突链模式编译为基于自动机算法的图形结构,并省略前缀表的查询,在保证效率和正确性的前提下,使优化后的算法有一个稳定的匹配性能,可以使冲突链中的模式匹配时间复杂度达到线性,而与冲突链中模式条数无关。
Claims (10)
- 一种基于自动机的模式匹配的方法,所述方法包括:根据当前输入内容查找SHIFT偏移地址表,得到SHIFT偏移地址值;判断得到的所述SHIFT地址值是否为零;若所述SHIFT地址值为零,则根据所述当前输入内容的前缀计算HASH哈希值,并以所述HASH值作为索引进入实际匹配的模式子分组,按照预设的方式进行匹配查找,并且在所述模式子分组查找完成后,将当前输入内容向前偏移一个长度单位;若所述SHIFT地址值不为零,则将当前输入内容向后偏移所述SHIFT地址值个长度单位;判断所述当前输入内容是否全部扫描完成,若是,则输出匹配结果集;否则跳到所述根据当前输入内容查找SHIFT地址表的步骤。
- 如权利要求1所述的方法,其中,在所述根据当前输入内容查找SHIFT地址表之前,所述方法还包括:构建跳转用的SHIFT地址表;将一个或多个模式链散列到各个HASH桶中,以链式存储;以HASH桶中的每个模式链作为输入,若模式子分组中仅有一条所述模式链,则使用蛮力算法完成所述模式链的预处理;若模式子分组中的所述模式链的模式数大于1,则使用自动机算法完成所述模式链的预处理。
- 如权利要求2所述的方法,其中,所述将一个或多个模式链散列到各个HASH桶中,以链式存储,包括:将多个模式链散列到同一个HASH桶内;将同一个HASH桶内的模式子分组中多个模式链编译为自动机的图形。
- 如权利要求1所述的方法,其中,若所述模式子分组为空,则退出当前的所述模式子分组的匹配过程。
- 如权利要求1所述的方法,其中,所述对所述模式子分组按照预设的方式进行匹配查找,包括:使用蛮力算法对所述模式子分组进行匹配查找当前输入内容;如果匹配成功,则将当前匹配结果加入匹配结果集;否则,退出本次所述模式子分组匹配查找。
- 如权利要求1所述的方法,其中,按照预设的方式进行匹配查找,包括:使用自动机算法对当前的所述模式子分组进行匹配查找;如果自动机状态中存在输出,则将当前所有匹配结果加入匹配结果集;如果遇到失效的自动机状态,则退出本次所述模式子分组匹配查找。
- 一种基于自动机的模式匹配的装置,所述装置包括:第一匹配模块,设置为根据当前输入内容查找SHIFT地址表,得到SHIFT地址值;第一检测模块,设置为判断得到的所述SHIFT地址值是否为零;第二匹配模块,设置为若所述SHIFT地址值为零,则根据所述当前输入内容的前缀计算HASH值,并以所述HASH值作为索引进入实际匹配的模式子分组,对所述模式子分组按照预设的方式进行匹配查找,并且在所述模式子分组查找完成后,将当前输入内容向前偏移一个长度单位;若所述SHIFT地址值不为零,则将当前输入内容向后偏移所述SHIFT地址值个长度单位;第二检测模块,设置为判断所述当前输入内容是否全部扫描完成,若是,则输出匹配结果集;否则触发所述第一匹配模块根据当前输入内容查找SHIFT地址表。
- 如权利要求7所述的装置,其中,所述装置还包括:预处理模块,设置为构建跳转用的SHIFT地址表;将一个或多个模式链散列到各个HASH桶中,以链式存储;以HASH桶中的每个模式链作为输入,若模式子分组中仅有一条所述模式链,则使用蛮力算法完成所述模式链的预处理;若模式子分组中的所述模式链的模式数大于1,则使用自动机算法完成所述模式链的预处理。
- 如权利要求7所述的装置,其中,所述第二匹配模块设置为使用蛮力算法对所述模式子分组进行匹配查找当前输入内容;如果匹配成功,则将当前匹配结果加入匹配结果集;否则,退出本次所述模式子分组匹配查找。
- 如权利要求7所述的装置,其中,所述第二匹配模块设置为使用自动机算法对当前的所述模式子分组进行匹配查找,如果自动机状态中存在输出,则将当前所有匹配结果加入匹配结果集;如果遇到失效的自动机状态,则退出本次所述模式子分组匹配查找。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410681752.9 | 2014-11-24 | ||
CN201410681752.9A CN105701093A (zh) | 2014-11-24 | 2014-11-24 | 基于自动机的模式匹配的方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016082503A1 true WO2016082503A1 (zh) | 2016-06-02 |
Family
ID=56073511
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/080174 WO2016082503A1 (zh) | 2014-11-24 | 2015-05-29 | 基于自动机的模式匹配的方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105701093A (zh) |
WO (1) | WO2016082503A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117668527A (zh) * | 2024-01-31 | 2024-03-08 | 国网湖北省电力有限公司信息通信公司 | 一种大流量模型下的多特征识别方法及系统 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10298606B2 (en) * | 2017-01-06 | 2019-05-21 | Juniper Networks, Inc | Apparatus, system, and method for accelerating security inspections using inline pattern matching |
CN107797940B (zh) * | 2017-11-21 | 2021-02-23 | 四川巧夺天工信息安全智能设备有限公司 | 一种针对无法访问东芝硬盘数据区的恢复方法 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412858A (zh) * | 2012-07-02 | 2013-11-27 | 清华大学 | 用于文本或网络内容分析的大规模特征匹配的方法 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101060411B (zh) * | 2007-05-23 | 2013-04-03 | 西安交大捷普网络科技有限公司 | 可提高入侵检测系统检测速率和效率的多模匹配方法 |
CN101251845B (zh) * | 2008-03-13 | 2010-06-09 | 苏州爱迪比科技有限公司 | 利用改进的Wu-Manber算法进行多模式串匹配的方法 |
CN102609450B (zh) * | 2012-01-10 | 2014-07-23 | 顾乃杰 | 一种按字长匹配的多模式串匹配方法 |
-
2014
- 2014-11-24 CN CN201410681752.9A patent/CN105701093A/zh not_active Withdrawn
-
2015
- 2015-05-29 WO PCT/CN2015/080174 patent/WO2016082503A1/zh active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412858A (zh) * | 2012-07-02 | 2013-11-27 | 清华大学 | 用于文本或网络内容分析的大规模特征匹配的方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117668527A (zh) * | 2024-01-31 | 2024-03-08 | 国网湖北省电力有限公司信息通信公司 | 一种大流量模型下的多特征识别方法及系统 |
CN117668527B (zh) * | 2024-01-31 | 2024-04-26 | 国网湖北省电力有限公司信息通信公司 | 一种大流量模型下的多特征识别方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN105701093A (zh) | 2016-06-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11693839B2 (en) | Parser for schema-free data exchange format | |
US20180330008A1 (en) | Incremental Graph Computations for Querying Large Graphs | |
US8775457B2 (en) | Efficient string matching state machine | |
CN108595517A (zh) | 一种大规模文档相似性检测方法 | |
KR101617696B1 (ko) | 데이터 정규표현식의 마이닝 방법 및 장치 | |
CN112667636B (zh) | 索引建立方法、装置及存储介质 | |
WO2016082503A1 (zh) | 基于自动机的模式匹配的方法及装置 | |
CN111078672B (zh) | 数据库的数据对比方法及装置 | |
CN103051543A (zh) | 一种路由前缀的处理、查找、增加及删除方法 | |
JP5960863B1 (ja) | 検索装置、検索方法、プログラム、及び記録媒体 | |
CN104572983A (zh) | 基于内存的散列表的构建方法、文本查找方法及相应装置 | |
CN102609450B (zh) | 一种按字长匹配的多模式串匹配方法 | |
Cantone et al. | A compact representation of nondeterministic (suffix) automata for the bit-parallel approach | |
CN110532284B (zh) | 海量数据存储和检索方法、装置、计算机设备及存储介质 | |
CN106599097A (zh) | 海量特征串集合的匹配方法和装置 | |
Ladwig et al. | Index structures and top-k join algorithms for native keyword search databases | |
US8069304B2 (en) | Determining the presence of a pre-specified string in a message | |
Ren et al. | An efficient gpu-based de bruijn graph construction algorithm for micro-assembly | |
CN112765269B (zh) | 数据处理方法、装置、设备和存储介质 | |
CN104008136A (zh) | 一种文本查找的方法和装置 | |
US11101018B2 (en) | Encoding and decoding of RNA data | |
Del Santo et al. | Reply to “Comment on ‘Physics without determinism: Alternative interpretations of classical physics’” | |
Bankevich et al. | LJA: Assembling long and accurate reads using multiplex de Bruijn graphs | |
CN113065419A (zh) | 一种基于流量高频内容的模式匹配算法及系统 | |
CN105653950A (zh) | 一种基于多模式的恶意代码匹配方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15864277 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15864277 Country of ref document: EP Kind code of ref document: A1 |