WO2024114655A1 - Rule expression matching method and apparatus, and computer-readable storage medium - Google Patents

Rule expression matching method and apparatus, and computer-readable storage medium Download PDF

Info

Publication number
WO2024114655A1
WO2024114655A1 PCT/CN2023/134854 CN2023134854W WO2024114655A1 WO 2024114655 A1 WO2024114655 A1 WO 2024114655A1 CN 2023134854 W CN2023134854 W CN 2023134854W WO 2024114655 A1 WO2024114655 A1 WO 2024114655A1
Authority
WO
WIPO (PCT)
Prior art keywords
matching
rule
regular expression
merged
regular
Prior art date
Application number
PCT/CN2023/134854
Other languages
French (fr)
Chinese (zh)
Inventor
李�瑞
Original Assignee
中国银联股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国银联股份有限公司 filed Critical 中国银联股份有限公司
Publication of WO2024114655A1 publication Critical patent/WO2024114655A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming

Definitions

  • the present invention belongs to the field of feature matching, and in particular relates to a regular expression matching method, device and computer-readable storage medium.
  • Some matching devices in the prior art use a line list method, which cannot configure more complex regular expression patterns, and have low scalability and flexibility.
  • Another part of the devices uses a regular expression method, but to determine whether the regular expression satisfies the established syntax, they often use a hard-coded text character parsing method or regular matching (the regular matching algorithm is based on a finite state machine and cannot operate on an infinite number of elements that need to be calculated).
  • regular matching algorithm is based on a finite state machine and cannot operate on an infinite number of elements that need to be calculated.
  • the present invention provides the following solutions.
  • a regular expression matching method comprising: receiving a regular text string, performing a syntax check on the regular text string, and outputting a regular expression; based on a simplification algorithm for a cyclic binary code, losslessly converting the regular expression into a simplest regular expression; based on a predicate calculus algorithm, equivalently converting the simplest regular expression into a regular expression matching tree; merging multiple regular expression matching trees into a merged matching network, and identifying common rule fragments; and performing feature matching on data to be matched using the merged matching network and the common rule fragments.
  • the regular expression is losslessly converted into a simplest regular expression based on a simplification algorithm based on a cyclic binary code, and further includes: obtaining all key elements in the regular expression, and generating all combinations of the multiple key elements based on the positive and negative values of each key element; obtaining a combination value range that makes the regular expression true from all combinations, and obtaining a binary code combination of the combination value range; performing a same-bit cyclic binary code merge on multiple binary codes in the binary code combination to obtain a simplified binary code combination; converting each binary bit in the simplified binary code combination back into the key element, and outputting the simplest regular expression.
  • performing the same position cycle binary code merging on the plurality of binary codes comprises: comparing the binary codes in the binary code combination in pairs, merging to generate a new binary code; comparing the new binary code with the uncompressed binary code; The original binary codes that can be combined are compared in pairs, combined to generate new binary codes and duplicate binary codes are removed; the above-mentioned merging steps are repeated until it is no longer possible to combine to generate new binary numbers.
  • the merging of the same-bit cyclic binary codes further includes: when there is only one different binary bit in the two binary codes, setting the different binary bit as a set symbol, and keeping the other identical binary bits unchanged as a new binary code.
  • each binary bit in the simplified binary code combination is converted back to the key element, including: for each binary code in the simplified binary code combination, converting it into a corresponding key element according to the position of the binary bit; performing a negation operation or no negation operation on the key element according to the value of each binary bit; and if the binary code includes a binary bit whose value is the set symbol, ignoring the corresponding key element.
  • performing grammatical verification on the regular text string further includes: performing grammatical verification on the regular text string for completeness using a context-free grammar and a recursive descent algorithm.
  • the grammar check of the regular text string also includes: reading the regular text string, dividing the regular text string according to predetermined delimiters to obtain multiple morphemes; sorting each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence; traversing the lexical unit sequence to check the grammar of the regular text string.
  • the morphemes are divided into a key element type and a logical operation type.
  • the simplest regular expression is equivalently converted into a regular expression matching tree; it also includes: repeatedly executing one or more of the following predicate deduction algorithms until stable, to obtain the regular expression matching tree: obtaining a rule tree corresponding to the simplest regular expression; if there are multiple non-operation child nodes of the rule tree, the non-operation is pushed down to the child node, and the operator and or operator are interchanged; if the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up, and the current operator is deleted; for the leaf nodes at the same level, they are sorted according to the unique attributes of the nodes.
  • a plurality of the rule expression matching trees are merged into a merged matching network, and common rule fragments are identified, including: selecting a rule expression matching tree for up-down transposition, identifying the rule expression matching tree square rule as the root node as the initial state of the merged matching network; traversing other rule expression matching trees one by one for up-down transposition, and integrating them into the merged matching network one by one; after the traversal is completed, a complete merged matching network is formed, and the common rule fragments are extracted.
  • the integration into the merged matching network one by one also includes: for the element nodes in a single rule expression matching tree, adding or reusing the element nodes in the merged matching network; and/or, for the logic symbol nodes in a single rule expression matching tree, adding or reusing the logic symbol nodes in the merged matching network; and/or, for completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search; and/or, for partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split.
  • the merged matching network and the common rule fragments are used to perform feature matching on the data to be matched, including: pre-prioritizing the various rule identifiers in the merged matching network; matching the element set involved in each rule identifier in the merged matching network with the data to be matched in sequence according to the priority until the match is successful or the match is completed.
  • the merged matching network and the common rule fragments are used to perform feature matching on the data to be matched, including one or more of the following operations: matching the data to be matched with a set of element nodes of the rule expression matching tree involved in each rule identifier, entering from the entrance of the merged matching network, and caching the element matching results if an element node of the rule expression matching tree is matched.
  • the merged matching network and the common rule fragment are used to perform feature matching on the data to be matched, and the method also includes: if a logical node of the rule expression matching tree is matched, querying in the cache whether the parent element node of the logical node has been hit, wherein: if there is no cached result, taking the next element node from the element node set to match the data to be matched; if there is a cached result, directly taking the cached result for logical operation; and, if the logical node belongs to a common rule fragment, caching the logical matching result.
  • using the merged matching network and the common rule fragment to perform feature matching on the to-be-matched data further includes: if a rule identification node of the rule expression matching tree is matched, returning a hit rule identification.
  • a regular expression matching device which is configured to execute any method as described in the first aspect, and the device includes: a grammar checker, which is used to receive a regular text string, perform grammar check on the regular text string, and output a regular expression; a feature converter, which is used to losslessly convert the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code; a predicate calculator, which is used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm; a network merger, which is used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments; and a feature matcher, which is used to perform feature matching on data to be matched using the merged matching network and the common rule fragments.
  • a grammar checker which is used to receive a regular text string, perform grammar check on the regular text string, and output a regular expression
  • a feature converter which is used to losslessly convert the regular expression into a simplest regular expression based on
  • a regular expression matching device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute: the method according to the first aspect.
  • a computer-readable storage medium stores a program, and when the program is executed by a multi-core processor, the multi-core processor executes the method of the first aspect.
  • One of the advantages of the above implementation is that it can significantly improve the matching efficiency.
  • FIG1 is a schematic diagram of the structure of a regular expression matching device according to an embodiment of the present invention.
  • FIG2 is a schematic diagram of a flow chart of a regular expression matching method according to an embodiment of the present invention.
  • FIG3 is a schematic diagram of a rule tree of a regular expression according to an embodiment of the present invention.
  • FIG4 is a schematic diagram of rule tree conversion according to an embodiment of the present invention.
  • FIG5 is a schematic diagram of rule tree conversion according to an embodiment of the present invention.
  • FIG6 is a schematic diagram of a rule tree inversion according to an embodiment of the present invention.
  • FIG7 is a schematic diagram of rule tree merging according to an embodiment of the present invention.
  • FIG8 is a schematic diagram of rule tree merging according to another embodiment of the present invention.
  • FIG9 is a schematic diagram of a rule tree according to an embodiment of the present invention.
  • FIG10 is a schematic diagram of rule tree merging according to an embodiment of the present invention.
  • A/B can mean A or B.
  • the “and/or” in this article is merely a way to describe the association relationship of associated objects, indicating that three relationships can exist.
  • a and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.
  • first”, “second”, etc. are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, a feature defined as “first”, “second”, etc. may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more.
  • FIG1 shows an exemplary regular expression matching device, which includes: a grammar checker 110, which is used to receive a regular text string, perform grammar check on the regular text string, output a regular expression, and ensure the rationality of the matching regular expression; a feature converter 120, which is used to losslessly convert the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code; a predicate calculator 130, which is used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm, so as to prepare for the subsequent identification of common fragments; a network merger 140, which is used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments; a feature matcher 150, which is used to perform feature matching on the data to be matched using the merged matching network and the common rule fragments. In this way, through a series of simplification steps, the matching efficiency can be significantly improved.
  • a grammar checker 110 which is used to receive a regular text string, perform grammar check
  • FIG. 2 shows a flow chart of a method for performing regular expression matching according to an embodiment of the present disclosure. It should be understood that the method 200 may also include additional blocks not shown and/or may omit the blocks shown, and the scope of the present disclosure is not limited in this respect.
  • Step 210 receiving a rule text string, performing syntax check on the rule text string, and outputting a rule expression
  • the regular text string is checked for grammatical completeness using a context-free grammar and a recursive descent algorithm.
  • the following steps may be specifically performed: read in the regular text string, split the regular text string according to predetermined delimiters to obtain a plurality of morphemes, wherein morphemes may be divided into key element types and logical operation types; sort each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence; traverse the lexical unit sequence to verify the grammar of the regular text string.
  • each potentially applicable rule can be configured as a corresponding text string, that is, a rule text string.
  • a rule text string For example, if there are currently the following two rule text strings:
  • the electronic device can receive the regular text string configured by the developer, etc., and can split the regular text string according to the agreed delimiter for each received regular text string, remove the extra spaces and line breaks, and obtain each single morpheme. For example, by splitting according to the comma "," you can get each morpheme, query the lexical unit attribute table, and obtain the morpheme attribute information of each morpheme:
  • Morphemes are divided into key element types and logical operation types.
  • Logical operation types are mainly composed of &,
  • Each morpheme can be sorted according to its order in the regular text string. For the convenience of description, the sorted morphemes are called lexical unit sequences.
  • a schematic diagram of a production model provided by some embodiments is shown below, which can be based on a production model created based on a context-free grammar conforming to extended Backus-Naur Form (EBNF), perform grammatical analysis on a regular text string and generate a rule tree corresponding to the regular text string.
  • EBNF extended Backus-Naur Form
  • each morpheme in the lexical unit sequence corresponding to the regular text string can be read in order, and a top-down recursive descent algorithm can be used, and a symbol is looked ahead each time, and the look-ahead symbol is used to guide the selection of grammatical rules (analysis functions) to determine an analysis function applicable to each morpheme.
  • analysis functions there are 5 types of analysis functions, namely: expression analysis function expr(), or analysis function or(), and left analysis function andLeftCond(), and right analysis function andRightCond(), and non-analysis function notCond().
  • Each analysis function can be processed according to a conventional recursive descent algorithm, and then the parsed lexical unit sequence is traversed once, the grammar of the regular expression is verified, and a rule tree corresponding to the regular text string is generated.
  • the process of establishing a rule tree corresponding to the rule text string may include:
  • Each key element contained in the rule text string is respectively included as a node in the rule tree corresponding to the rule text string, and the logical operators in the rule text string are also respectively included as a node in the rule tree.
  • the nodes that have an associated relationship in the rule text string can be connected to establish the rule tree corresponding to the rule text string.
  • a rule tree corresponding to a regular text string when establishing a rule tree corresponding to a regular text string, if the analysis function applicable to the morpheme in the lexical unit sequence of the regular text string is or analysis function or(), and right analysis function andLeftCond(), non-analysis function notCond(), etc., newly added matching intermediate nodes or (or), and (and), or (or), etc. can be established in the rule tree at the same time. If the morpheme (element) in the lexical unit sequence of the regular text string is a terminal symbol (TOK_COND), a corresponding leaf node can be added in the lower layer of the associated intermediate node. After all lexical units are traversed, if the grammar is satisfied, the rule tree shown in Figure 3 is established at the same time.
  • TOK_COND terminal symbol
  • a complete grammar verification method suitable for regular expression logical operations is created, which uses a recursive descent algorithm to complete the grammar verification of regular expressions only once, and can perform grammar checking on unlimited logical operation combinations.
  • Step 220 based on the simplification algorithm of cyclic binary code, losslessly convert the regular expression into a simplest regular expression
  • the above step 220 may further include:
  • Step 221 obtaining all key elements in the regular expression, and generating all combinations of the multiple key elements based on the positive and negative values of each key element;
  • the key elements are A, B, C, and D.
  • Each key element has a value of 0 or 1. Assuming that the key element represents 1, the value of the key element (negated) is 0.
  • Step 222 obtaining a combination value range that makes the regular expression true from all combinations, and obtaining a binary code combination of the combination value range;
  • Step 223 performing co-position cyclic binary code merging on multiple binary codes in the binary code combination to obtain a simplified binary code combination
  • the above step 223 may specifically include:
  • Step 224 convert each binary bit in the simplified binary code combination back to the key element, and output the simplest regular expression.
  • the above step 224 may specifically include one or more of the following operations:
  • the merging process is as follows:
  • the above-mentioned simplification algorithm based on cyclic binary code is used to simplify the rule expression configured by the user, automatically remove and simplify redundant rule expression fragments, perform lossless conversion into the simplest rule expression, and generate a matching tree based on the simplified rule expression.
  • the text rules configured by the user are losslessly simplified into the simplest regular expressions through the simplest lossless conversion of the expressions, which greatly simplifies the subsequent matching process and improves the matching efficiency.
  • Step 230 converting the simplest regular expression into a regular expression matching tree based on a predicate calculus algorithm
  • the above step 230 may specifically include: first, obtaining a rule tree corresponding to the simplest regular expression, and then repeatedly executing the following one or more predicate deduction algorithms until they are stable, thereby obtaining the regular expression matching tree:
  • the negation operation is pushed down to the child nodes, wherein the AND operation “&” and the OR operation “
  • the child node of the current operator is moved up, and the current operator is deleted.
  • leaf nodes in the same layer are sorted by unique attributes of the nodes, such as ascending or descending order of field ID, and non-leaf nodes in the same layer are sorted at the back.
  • the unique attributes of non-leaf nodes are composed of the unique attributes of their child nodes, so non-leaf nodes in the same layer are also sorted by unique attributes.
  • Step 240 merging multiple rule expression matching trees into a merged matching network, and identifying common rule fragments
  • the above step 240 may further include:
  • the merged matching network also includes: for the element nodes in the single regular expression matching tree, adding or reusing the element nodes in the merged matching network; and/or, for the single regular expression
  • the logic symbol nodes in the matching tree are newly added or reused in the merged matching network; and/or, for completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search, for example, see Figure 7, where "Condition 5 & Condition 6" are common fragments, and Rule 1 and Rule 2 are shared; and/or, for partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split, for example, see Figure 8.
  • FIG9 is merged to generate the merged matching network shown in FIG10 , in which the elements in the dashed box are common rule fragments, and the element set involved in each rule identifier (rule 1 and rule 2) is recorded at the same time.
  • Step 250 perform feature matching on the to-be-matched data using the merged matching network and the common rule fragments.
  • each rule identifier in the merged matching network may be prioritized in advance; and the element set involved in each rule identifier in the merged matching network is matched with the data to be matched in sequence according to the priority until the match succeeds or ends.
  • using the merged matching network and the common rule fragment to perform feature matching on the to-be-matched data includes:
  • Element matching According to the element node set of the rule expression matching tree involved in each rule identification, enter from the entrance of the merged matching network, match the element set with the data to be matched, and cache the element matching results.
  • the device in the implementation mode of the present application can implement each process of the implementation mode of the aforementioned method and achieve the same effects and functions, which will not be repeated here.
  • a non-volatile computer storage medium of a regular expression matching method on which computer executable instructions are stored, and the computer executable instructions are configured to execute the method described in the above embodiments when executed by a processor.
  • the apparatus, equipment and computer-readable storage medium provided in the embodiments of the present application correspond one-to-one to the method. Therefore, the apparatus, equipment and computer-readable storage medium also have similar beneficial technical effects as the corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the apparatus, equipment and computer-readable storage medium will not be repeated here.
  • the embodiments of the present invention may be provided as methods, devices (equipment or system), or computer-readable storage media. Therefore, the present invention may be implemented in the form of a complete hardware implementation, a complete software implementation, or an implementation combining software and hardware. Moreover, the present invention may be implemented in the form of a computer-readable storage medium implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions.
  • These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media that can implement information storage by any method or technology.
  • Information can be computer-readable instructions, data structures, program modules or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium may be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM Electrically erasable programmable read-only memory
  • flash memory or other memory technology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

Provided in the present invention are a rule expression matching method and apparatus, and a computer-readable storage medium. The method comprises: receiving a rule text string, performing syntax validation on the rule text string, and outputting a rule expression (210); on the basis of a reduction algorithm for cyclic binary codes, converting the rule expression into the simplest rule expression in a lossless manner (220); on the basis of a predicate calculus algorithm, equivalently converting the simplest rule expression into a rule expression matching tree (230); merging a plurality of rule expression matching trees into a merged matching network, and identifying a common rule fragment (240); and using the merged matching network and the common rule fragment to perform feature matching on data to be subjected to matching (250). By using the method, the matching efficiency can be increased.

Description

一种规则表达式匹配方法、装置及计算机可读存储介质A regular expression matching method, device and computer readable storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求在2022年11月29日提交中国专利局、申请号为202211515709.6、申请名称为“一种规则表达式匹配方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on November 29, 2022, with application number 202211515709.6 and application name “A regular expression matching method, device and computer-readable storage medium”, the entire contents of which are incorporated by reference in this application.
技术领域Technical Field
本发明属于特征匹配领域,具体涉及一种规则表达式匹配方法、装置及计算机可读存储介质。The present invention belongs to the field of feature matching, and in particular relates to a regular expression matching method, device and computer-readable storage medium.
背景技术Background technique
本部分旨在为权利要求书中陈述的本发明的实施方式提供背景或上下文。此处的描述不因为包括在本部分中就承认是现有技术。This section is intended to provide a background or context to embodiments of the invention that are recited in the claims. No description herein is admitted to be prior art by inclusion in this section.
现有技术中的匹配装置一部分采用行列表方式,无法配置较复杂的规则表达式模式,扩展性和灵活性较低。另一部分装置采用规则表达式方式,但是对于规则表达式是否满足既定语法,往往采用文本字符解析硬编码方式、或者正则匹配(正则匹配算法是基于有限状态机,无法针对无限个需要计算元素进行运算)。这些校验方式仅适用于简单场景,无法对规则表达式模式的所有组合场景进行完备性校验。对于业务配置的规则表达式,缺少有效等价谓词运算,将用户配置的规则表达式化简为最简表达式,缺少识别规则表达式公共片段,影响后续匹配效率。Some matching devices in the prior art use a line list method, which cannot configure more complex regular expression patterns, and have low scalability and flexibility. Another part of the devices uses a regular expression method, but to determine whether the regular expression satisfies the established syntax, they often use a hard-coded text character parsing method or regular matching (the regular matching algorithm is based on a finite state machine and cannot operate on an infinite number of elements that need to be calculated). These verification methods are only applicable to simple scenarios and cannot perform completeness verification on all combination scenarios of the regular expression pattern. For the regular expressions configured for business purposes, there is a lack of effective equivalent predicate operations, which simplifies the user-configured regular expressions into the simplest expressions, and lacks the recognition of common fragments of regular expressions, which affects the subsequent matching efficiency.
因此,如何提升匹配效率是一个亟待解决的问题。Therefore, how to improve matching efficiency is an urgent problem to be solved.
发明内容Summary of the invention
针对上述现有技术中存在的问题,提出了一种规则表达式匹配方法、装置及计算机可读存储介质,利用这种方法、装置及计算机可读存储介质,能够解决上述问题。In view of the problems existing in the above-mentioned prior art, a regular expression matching method, device and computer-readable storage medium are proposed. The above-mentioned problems can be solved by using this method, device and computer-readable storage medium.
本发明提供了以下方案。The present invention provides the following solutions.
第一方面,提供一种规则表达式匹配方法,包括:接收规则文本串,对规则文本串进行语法校验,输出规则表达式;基于循环二进制码的化简算法,将规则表达式无损转换为最简规则表达式;基于谓词演算算法将最简规则表达式等价转化为规则表达式匹配树;将多个规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段;利用合并匹配网络和公共规则片段对待匹配数据进行特征匹配。In a first aspect, a regular expression matching method is provided, comprising: receiving a regular text string, performing a syntax check on the regular text string, and outputting a regular expression; based on a simplification algorithm for a cyclic binary code, losslessly converting the regular expression into a simplest regular expression; based on a predicate calculus algorithm, equivalently converting the simplest regular expression into a regular expression matching tree; merging multiple regular expression matching trees into a merged matching network, and identifying common rule fragments; and performing feature matching on data to be matched using the merged matching network and the common rule fragments.
在一种实施方式中,基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式,还包括:获取所述规则表达式中的全部关键要素,基于每个关键要素的正反取值生成所述多个关键要素的全部组合;从全部组合中获取使所述规则表达式为真的的组合取值范围,并获取所述组合取值范围的二进制码组合;对所述二进制码组合中的多个二进制码进行同位循环二进制码合并,得到简化二进制码组合;将所述简化二进制码组合中的各个二进制位转化回所述关键要素,输出所述最简规则表达式。In one embodiment, the regular expression is losslessly converted into a simplest regular expression based on a simplification algorithm based on a cyclic binary code, and further includes: obtaining all key elements in the regular expression, and generating all combinations of the multiple key elements based on the positive and negative values of each key element; obtaining a combination value range that makes the regular expression true from all combinations, and obtaining a binary code combination of the combination value range; performing a same-bit cyclic binary code merge on multiple binary codes in the binary code combination to obtain a simplified binary code combination; converting each binary bit in the simplified binary code combination back into the key element, and outputting the simplest regular expression.
在一种实施方式中,对所述多个二进制码进行同位循环二进制码合并,包括:将所述二进制码组合中的二进制码进行两两比较,合并生成新二进制码;将所述新二进制码和未 能合并的原有二进制码进行两两比较,合并生成新二进制码并去除重复二进制码;重复循环上述合并步骤,直至无法再次合并生成新二进制数为止。In one embodiment, performing the same position cycle binary code merging on the plurality of binary codes comprises: comparing the binary codes in the binary code combination in pairs, merging to generate a new binary code; comparing the new binary code with the uncompressed binary code; The original binary codes that can be combined are compared in pairs, combined to generate new binary codes and duplicate binary codes are removed; the above-mentioned merging steps are repeated until it is no longer possible to combine to generate new binary numbers.
在一种实施方式中,所述同位循环二进制码合并,还包括:当两个二进制码仅存在一个不同的二进制位时,将该不同的二进制位设为设定符号,并保持其余相同的二进制位不变,作为新二进制码。In one embodiment, the merging of the same-bit cyclic binary codes further includes: when there is only one different binary bit in the two binary codes, setting the different binary bit as a set symbol, and keeping the other identical binary bits unchanged as a new binary code.
在一种实施方式中,所述简化二进制码组合中的各个二进制位转化回所述关键要素,包括:针对简化二进制码组合中的每个二进制码,按照二进制位的位置转化为对应的关键要素;根据每个二进制位的取值对所述关键要素进行取非操作或不取非操作;以及,若所述二进制码中包括取值为所述设定符号的二进制位,则忽略对应的关键要素。In one embodiment, each binary bit in the simplified binary code combination is converted back to the key element, including: for each binary code in the simplified binary code combination, converting it into a corresponding key element according to the position of the binary bit; performing a negation operation or no negation operation on the key element according to the value of each binary bit; and if the binary code includes a binary bit whose value is the set symbol, ignoring the corresponding key element.
在一种实施方式中,对所述规则文本串进行语法校验,还包括:利用上下文无关文法和递归下降算法,对所述规则文本串进行完备性语法校验。In one embodiment, performing grammatical verification on the regular text string further includes: performing grammatical verification on the regular text string for completeness using a context-free grammar and a recursive descent algorithm.
在一种实施方式中,对所述规则文本串进行语法校验,还包括:读入所述规则文本串,按照预定分隔符分割所述规则文本串,得到多个词素;按照所述规则文本串中的词素顺序,对每个词素进行排序,生成词法单元序列;遍历所述词法单元序列,校验所述规则文本串的语法。In one embodiment, the grammar check of the regular text string also includes: reading the regular text string, dividing the regular text string according to predetermined delimiters to obtain multiple morphemes; sorting each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence; traversing the lexical unit sequence to check the grammar of the regular text string.
在一种实施方式中,所述词素分为关键要素类型和逻辑运算类型。In one implementation, the morphemes are divided into a key element type and a logical operation type.
在一种实施方式中,将所述最简规则表达式等价转化为规则表达式匹配树;还包括:重复执行以下一种或多种谓词推演算法直至稳定,得到所述规则表达式匹配树:获取所述最简规则表达式对应的规则树;若所述规则树的非运算的子节点有多个,则将所述非运算下推到子节点中,并将与运算符和或运算符互换;若所述最简规则表达式的当前运算符与父节点运算符一致,则将当前运算符的子节点上移,并删除当前运算符;针对同一层叶子节点,按照节点唯一属性排序。In one embodiment, the simplest regular expression is equivalently converted into a regular expression matching tree; it also includes: repeatedly executing one or more of the following predicate deduction algorithms until stable, to obtain the regular expression matching tree: obtaining a rule tree corresponding to the simplest regular expression; if there are multiple non-operation child nodes of the rule tree, the non-operation is pushed down to the child node, and the operator and or operator are interchanged; if the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up, and the current operator is deleted; for the leaf nodes at the same level, they are sorted according to the unique attributes of the nodes.
在一种实施方式中,将多个所述规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段,包括:选择一个规则表达式匹配树进行上下转置,将规则表达式匹配树方规则标识为根节点,作为所述合并匹配网络的初始状态;逐个遍历其他的规则表达式匹配树进行上下转置,并逐个融合进所述合并匹配网络中;遍历完成后,形成完整的合并匹配网络,并提取出公共规则片段。In one embodiment, a plurality of the rule expression matching trees are merged into a merged matching network, and common rule fragments are identified, including: selecting a rule expression matching tree for up-down transposition, identifying the rule expression matching tree square rule as the root node as the initial state of the merged matching network; traversing other rule expression matching trees one by one for up-down transposition, and integrating them into the merged matching network one by one; after the traversal is completed, a complete merged matching network is formed, and the common rule fragments are extracted.
在一种实施方式中,逐个融合进所述合并匹配网络,还包括:对于单个规则表达式匹配树中的要素节点,新增或复用所述合并匹配网络中的要素节点;和/或,对于单个规则表达式匹配树中的逻辑符节点,新增或复用所述合并匹配网络中的逻辑符节点;和/或,对于完全重合的逻辑符节点,通过反向搜索可提取出所述公共规则片段及其所属规则表达式匹配树;和/或,对于部分重合的逻辑符节点,拆分所述合并匹配网络中逻辑符节点。In one embodiment, the integration into the merged matching network one by one also includes: for the element nodes in a single rule expression matching tree, adding or reusing the element nodes in the merged matching network; and/or, for the logic symbol nodes in a single rule expression matching tree, adding or reusing the logic symbol nodes in the merged matching network; and/or, for completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search; and/or, for partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split.
在一种实施方式中,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,包括:预先为所述合并匹配网络中的各个规则标识进行优先级排序;依据所述优先级依序将合并匹配网络中的每个规则标识所涉及的要素集合与所述待匹配数据进行匹配,直至匹配成功或匹配结束。In one embodiment, the merged matching network and the common rule fragments are used to perform feature matching on the data to be matched, including: pre-prioritizing the various rule identifiers in the merged matching network; matching the element set involved in each rule identifier in the merged matching network with the data to be matched in sequence according to the priority until the match is successful or the match is completed.
在一种实施方式中,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,包括以下中的一种或多种操作:将所述待匹配数据与每个规则标识所涉及规则表达式匹配树的要素节点集合进行匹配,从所述合并匹配网络的入口进入,如果匹配到所述规则表达式匹配树的要素节点,则缓存要素匹配结果。 In one embodiment, the merged matching network and the common rule fragments are used to perform feature matching on the data to be matched, including one or more of the following operations: matching the data to be matched with a set of element nodes of the rule expression matching tree involved in each rule identifier, entering from the entrance of the merged matching network, and caching the element matching results if an element node of the rule expression matching tree is matched.
在一种实施方式中,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,还包括:如果匹配到所述规则表达式匹配树的逻辑节点,则在缓存中查询所述逻辑节点的父要素节点是否已命中,其中:如无缓存结果,则从所述要素节点集合取下一个要素节点,与待匹配数据进行匹配;如有缓存结果,则直接取所述缓存结果进行逻辑运算;并且,如所述逻辑节点属于公共规则片段,则缓存逻辑匹配结果。In one embodiment, the merged matching network and the common rule fragment are used to perform feature matching on the data to be matched, and the method also includes: if a logical node of the rule expression matching tree is matched, querying in the cache whether the parent element node of the logical node has been hit, wherein: if there is no cached result, taking the next element node from the element node set to match the data to be matched; if there is a cached result, directly taking the cached result for logical operation; and, if the logical node belongs to a common rule fragment, caching the logical matching result.
在一种实施方式中,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,还包括:如果匹配到所述规则表达式匹配树的规则标识节点,则返回命中的规则标识。In one embodiment, using the merged matching network and the common rule fragment to perform feature matching on the to-be-matched data further includes: if a rule identification node of the rule expression matching tree is matched, returning a hit rule identification.
第二方面,提供一种规则表达式匹配装置,被配置为用于执行如第一方面中任一项方法,该装置包括:语法校验器,用于接收规则文本串,对所述规则文本串进行语法校验,输出规则表达式;特征转换器,用于基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式;谓词演算器,用于基于谓词演算算法将所述最简规则表达式等价转化为规则表达式匹配树;网络合并器,用于将多个规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段;特征匹配器,用于利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配。In a second aspect, a regular expression matching device is provided, which is configured to execute any method as described in the first aspect, and the device includes: a grammar checker, which is used to receive a regular text string, perform grammar check on the regular text string, and output a regular expression; a feature converter, which is used to losslessly convert the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code; a predicate calculator, which is used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm; a network merger, which is used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments; and a feature matcher, which is used to perform feature matching on data to be matched using the merged matching network and the common rule fragments.
第三方面,提供一种规则表达式匹配装置,包括:至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行:如第一方面的方法。According to a third aspect, a regular expression matching device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute: the method according to the first aspect.
第四方面,提供一种计算机可读存储介质,所述计算机可读存储介质存储有程序,当所述程序被多核处理器执行时,使得所述多核处理器执行如第一方面的方法。In a fourth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores a program, and when the program is executed by a multi-core processor, the multi-core processor executes the method of the first aspect.
上述实施方式的优点之一,能够显著提高匹配效率。One of the advantages of the above implementation is that it can significantly improve the matching efficiency.
本发明的其他优点将配合以下的说明和附图进行更详细的解说。Other advantages of the present invention will be explained in more detail with reference to the following description and accompanying drawings.
应当理解,上述说明仅是本发明技术方案的概述,以便能够更清楚地了解本发明的技术手段,从而可依照说明书的内容予以实施。为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举例说明本发明的具体实施方式。It should be understood that the above description is only an overview of the technical solution of the present invention, so that the technical means of the present invention can be more clearly understood and implemented according to the contents of the specification. In order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand, the specific implementation methods of the present invention are described below by way of example.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过阅读下文的示例性实施方式的详细描述,本领域普通技术人员将明白本文所述的优点和益处以及其他优点和益处。附图仅用于示出示例性实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的标号表示相同的部件。在附图中:The advantages and benefits described herein and other advantages and benefits will be apparent to those of ordinary skill in the art upon reading the detailed description of the exemplary embodiments below. The accompanying drawings are only for the purpose of illustrating exemplary embodiments and are not to be considered as limiting the present invention. Also, the same reference numerals are used throughout the accompanying drawings to represent the same components. In the accompanying drawings:
图1为根据本发明一实施方式的规则表达式匹配装置的结构示意图;FIG1 is a schematic diagram of the structure of a regular expression matching device according to an embodiment of the present invention;
图2为根据本发明一实施方式的规则表达式匹配方法的流程示意图;FIG2 is a schematic diagram of a flow chart of a regular expression matching method according to an embodiment of the present invention;
图3为根据本发明一实施方式的规则表达式的规则树示意图;FIG3 is a schematic diagram of a rule tree of a regular expression according to an embodiment of the present invention;
图4为根据本发明一实施方式的规则树转换示意图;FIG4 is a schematic diagram of rule tree conversion according to an embodiment of the present invention;
图5为根据本发明一实施方式的规则树转换示意图;FIG5 is a schematic diagram of rule tree conversion according to an embodiment of the present invention;
图6为根据本发明一实施方式的规则树倒置示意图;FIG6 is a schematic diagram of a rule tree inversion according to an embodiment of the present invention;
图7为根据本发明一实施方式的规则树合并示意图;FIG7 is a schematic diagram of rule tree merging according to an embodiment of the present invention;
图8为根据本发明另一实施方式的规则树合并示意图;FIG8 is a schematic diagram of rule tree merging according to another embodiment of the present invention;
图9为根据本发明一实施方式的规则树示意图;FIG9 is a schematic diagram of a rule tree according to an embodiment of the present invention;
图10为根据本发明一实施方式的规则树合并示意图;FIG10 is a schematic diagram of rule tree merging according to an embodiment of the present invention;
在附图中,相同或对应的标号表示相同或对应的部分。 In the drawings, the same or corresponding reference numerals represent the same or corresponding parts.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施方式。虽然附图中显示了本公开的示例性实施方式,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。The exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments described herein. On the contrary, these embodiments are provided in order to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.
在本申请实施方式的描述中,应理解,诸如“包括”或“具有”等术语旨在指示本说明书中所公开的特征、数字、步骤、行为、部件、部分或其组合的存在,并且不旨在排除一个或多个其他特征、数字、步骤、行为、部件、部分或其组合存在的可能性。In the description of the embodiments of the present application, it should be understood that terms such as "including" or "having" are intended to indicate the presence of features, numbers, steps, behaviors, components, parts, or a combination thereof disclosed in this specification, and are not intended to exclude the possibility of the presence of one or more other features, numbers, steps, behaviors, components, parts, or a combination thereof.
除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。Unless otherwise specified, “/” means or. For example, A/B can mean A or B. The “and/or” in this article is merely a way to describe the association relationship of associated objects, indicating that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.
术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”等的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请实施方式的描述中,除非另有说明,“多个”的含义是两个或两个以上。The terms "first", "second", etc. are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, a feature defined as "first", "second", etc. may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more.
图1示出了一种示例性规则表达式匹配装置,该装置包括:语法校验器110,用于接收规则文本串,对所述规则文本串进行语法校验,输出规则表达式,确保匹配规则表达式的合理性;特征转换器120,用于基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式;谓词演算器130,用于基于谓词演算算法将所述最简规则表达式等价转化为规则表达式匹配树,为后续公共片段的识别做好准备;网络合并器140,用于将多个规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段;特征匹配器150,用于利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配。如此,通过一系列化简步骤,能够显著提高匹配效率。FIG1 shows an exemplary regular expression matching device, which includes: a grammar checker 110, which is used to receive a regular text string, perform grammar check on the regular text string, output a regular expression, and ensure the rationality of the matching regular expression; a feature converter 120, which is used to losslessly convert the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code; a predicate calculator 130, which is used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm, so as to prepare for the subsequent identification of common fragments; a network merger 140, which is used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments; a feature matcher 150, which is used to perform feature matching on the data to be matched using the merged matching network and the common rule fragments. In this way, through a series of simplification steps, the matching efficiency can be significantly improved.
图2示出了根据本公开的实施方式的用于执行规则表达式匹配方法的流程图。应当理解的是,方法200还可以包括未示出的附加框和/或可以省略所示出的框,本公开的范围在此方面不受限制。2 shows a flow chart of a method for performing regular expression matching according to an embodiment of the present disclosure. It should be understood that the method 200 may also include additional blocks not shown and/or may omit the blocks shown, and the scope of the present disclosure is not limited in this respect.
步骤210,接收规则文本串,对所述规则文本串进行语法校验,输出规则表达式;Step 210, receiving a rule text string, performing syntax check on the rule text string, and outputting a rule expression;
在一种实施方式中,利用上下文无关文法和递归下降算法,对所述规则文本串进行完备性语法校验。In one embodiment, the regular text string is checked for grammatical completeness using a context-free grammar and a recursive descent algorithm.
在一种实施方式中,为了实现对规则文本串的语法校验,还可以具体执行以下步骤:读入所述规则文本串,按照预定分隔符分割所述规则文本串,得到多个词素,其中,词素可分为关键要素类型和逻辑运算类型;按照所述规则文本串中的词素顺序,对每个词素进行排序,生成词法单元序列;遍历所述词法单元序列,校验所述规则文本串的语法。In one embodiment, in order to implement grammatical verification of a regular text string, the following steps may be specifically performed: read in the regular text string, split the regular text string according to predetermined delimiters to obtain a plurality of morphemes, wherein morphemes may be divided into key element types and logical operation types; sort each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence; traverse the lexical unit sequence to verify the grammar of the regular text string.
具体地,可以将每个可能适用的规则配置为对应的文本串,也即规则文本串。示例性的,假如,目前有以下两条规则文本串:
Specifically, each potentially applicable rule can be configured as a corresponding text string, that is, a rule text string. For example, if there are currently the following two rule text strings:
电子设备可以接收开发人员等配置的规则文本串,并可以针对接收到的每个规则文本串,按照约定分隔符分割规则文本串,去除多余空格和换行符,得到各个单个词素。比如按照逗号“,”分割,即可得到各个词素,查询词法单元属性表,获取每个词素的词素属性信息:
The electronic device can receive the regular text string configured by the developer, etc., and can split the regular text string according to the agreed delimiter for each received regular text string, remove the extra spaces and line breaks, and obtain each single morpheme. For example, by splitting according to the comma "," you can get each morpheme, query the lexical unit attribute table, and obtain the morpheme attribute information of each morpheme:
词素分为关键要素类型和逻辑运算类型:逻辑运算类型主要由与&、或|、非!、括号()组成;关键要素类型与具体业务场景的含义有关,例如外卡内用、非金机构等。Morphemes are divided into key element types and logical operation types. Logical operation types are mainly composed of &, |, NOT!, and brackets (); key element types are related to the meaning of specific business scenarios, such as foreign card internal use and non-financial institutions.
可以按照每个词素在规则文本串中的顺序,对每个词素进行排序,为方便描述,将排序后的词素称为词法单元序列。可以将基于上下文无关文法创建的模型称为产生式模型G=(N,Σ,P,S)。Each morpheme can be sorted according to its order in the regular text string. For the convenience of description, the sorted morphemes are called lexical unit sequences. The model created based on the context-free grammar can be called a production model G = (N, Σ, P, S).
示例性的,以下示出了一些实施例提供的产生式模型示意图,可以基于符合扩展巴科斯-瑙尔范式(EBNF)的上下文无关文法创建的产生式模型,对规则文本串进行语法分析并生成该规则文本串对应的规则树。

Exemplarily, a schematic diagram of a production model provided by some embodiments is shown below, which can be based on a production model created based on a context-free grammar conforming to extended Backus-Naur Form (EBNF), perform grammatical analysis on a regular text string and generate a rule tree corresponding to the regular text string.

示例性的,在基于由上下文无关文法创建的产生式模型,对规则文本串进行语法分析时,可以按顺序读取规则文本串对应的词法单元序列中的每个词素,并可以采用自顶向下的递归下降算法,并每次前看一个符号,用前看符号指导语法规则(分析函数)的选择,确定每一个词素适用的一种分析函数。示例性的,分析函数共有5种,分别为:表达式分析函数expr()、或分析函数or()、与左分析函数andLeftCond()、与右分析函数andRightCond()、非分析函数notCond()。每种分析函数可以按照常规递归下降算法进行处理随后,遍历一次已解析好的词法单元序列,校验规则表达式的语法,又生成规则文本串对应规则树。Exemplarily, when performing grammatical analysis on a regular text string based on a production model created by a context-free grammar, each morpheme in the lexical unit sequence corresponding to the regular text string can be read in order, and a top-down recursive descent algorithm can be used, and a symbol is looked ahead each time, and the look-ahead symbol is used to guide the selection of grammatical rules (analysis functions) to determine an analysis function applicable to each morpheme. Exemplarily, there are 5 types of analysis functions, namely: expression analysis function expr(), or analysis function or(), and left analysis function andLeftCond(), and right analysis function andRightCond(), and non-analysis function notCond(). Each analysis function can be processed according to a conventional recursive descent algorithm, and then the parsed lexical unit sequence is traversed once, the grammar of the regular expression is verified, and a rule tree corresponding to the regular text string is generated.
具体的,建立规则文本串对应的规则树的过程可以包括:Specifically, the process of establishing a rule tree corresponding to the rule text string may include:
将规则文本串中包含的每个关键要素均分别作为该规则文本串对应的规则树中包含的一个节点,将规则文本串中的逻辑运算符也分别作为规则树中包含的一个节点,并可以将在规则文本串中存在关联关系的节点之间进行连接,从而建立该规则文本串对应的规则树。Each key element contained in the rule text string is respectively included as a node in the rule tree corresponding to the rule text string, and the logical operators in the rule text string are also respectively included as a node in the rule tree. The nodes that have an associated relationship in the rule text string can be connected to establish the rule tree corresponding to the rule text string.
在一种可能的实施方式中,在建立规则文本串对应的规则树时,如果遇到规则文本串的词法单元序列中的词素适用的分析函数为或分析函数or()、与右分析函数andLeftCond()、非分析函数notCond()等时,可以分别在规则树中同时建立新增匹配中间节点or(或)、and(且)、or(或)等,如果遇到规则文本串的词法单元序列中词素(要素)为终结符(TOK_COND)时,则可以在关联中间节点的下层中新增一个对应的叶子节点。当全部词法单元遍历完成之后,若满足语法,则同时完成建立如图3所示的规则树。In a possible implementation, when establishing a rule tree corresponding to a regular text string, if the analysis function applicable to the morpheme in the lexical unit sequence of the regular text string is or analysis function or(), and right analysis function andLeftCond(), non-analysis function notCond(), etc., newly added matching intermediate nodes or (or), and (and), or (or), etc. can be established in the rule tree at the same time. If the morpheme (element) in the lexical unit sequence of the regular text string is a terminal symbol (TOK_COND), a corresponding leaf node can be added in the lower layer of the associated intermediate node. After all lexical units are traversed, if the grammar is satisfied, the rule tree shown in Figure 3 is established at the same time.
本实施例中,创建了一种适用于规则表达式逻辑运算的完备文法校验方法,采用递归下降算法,仅遍历一次完成规则表达式的语法校验,可以对无限的逻辑运算组合进行语法检查。In this embodiment, a complete grammar verification method suitable for regular expression logical operations is created, which uses a recursive descent algorithm to complete the grammar verification of regular expressions only once, and can perform grammar checking on unlimited logical operation combinations.
步骤220,基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式;Step 220, based on the simplification algorithm of cyclic binary code, losslessly convert the regular expression into a simplest regular expression;
在一种实施方式中,上述步骤220进一步还可以包括:In one implementation, the above step 220 may further include:
步骤221,获取所述规则表达式中的全部关键要素,基于每个关键要素的正反取值生成所述多个关键要素的全部组合;Step 221, obtaining all key elements in the regular expression, and generating all combinations of the multiple key elements based on the positive and negative values of each key element;
例如,基于前述步骤210输出的规则表达式:
F=(!A&C&D)|(!A&!C&D)|(A&!B&D)|(A&!B&!D)|(A&!B&C&D)
For example, based on the regular expression outputted in the aforementioned step 210:
F = (!A&C&D)|(!A&!C&D)|(A&!B&D)|(A&!B&!D)|(A&!B&C&D)
其中,关键要素为A、B、C、D,每个关键要素取值分别为0或者1,假定关键要素表示1,该!关键要素(取反)取值为0,关键要素的取值范围为2^N=2^4=16(2的N次 方,N为关键要素个数),采用二进制字符串格式:0000、0001、0010、0011、0100、...、1111。对应关系如下:!A!B!C!D=0000、!A!B!CD=0001、!A!BC!D=0010、...、ABCD=1111。Among them, the key elements are A, B, C, and D. Each key element has a value of 0 or 1. Assuming that the key element represents 1, the value of the key element (negated) is 0. The value range of the key element is 2^N=2^4=16 (2 times N N is the number of key elements), using the binary string format: 0000, 0001, 0010, 0011, 0100, ..., 1111. The corresponding relationship is as follows: !A!B!C!D=0000, !A!B!CD=0001, !A!BC!D=0010, ...,ABCD=1111.
对上述二进制字符串用其十进制表示,即:The above binary string is represented in decimal, that is:
!A!B!C!D=0000=0、!A!B!CD=0001=1、!A!BC!D=0010=2、...、ABCD=1111=15。! A! B! C! D=0000=0,! A! B! CD=0001=1,! A! BC! D=0010=2, ..., ABCD=1111=15.
步骤222,从全部组合中获取使所述规则表达式为真的的组合取值范围,并获取所述组合取值范围的二进制码组合;Step 222, obtaining a combination value range that makes the regular expression true from all combinations, and obtaining a binary code combination of the combination value range;
例如,从所有组合中取出能够让原规则表达式为真的取值范围,即:
f(A,B,C,D)=(!A&!B&C&D)||(!A&B&C&D)||(!A&!B!C&D)
||(A&B!C&D)||(A&!B&!C&!D)||(A&!B&!C&D)||(A&!B&C&D)||(A&!B&C&!D)
For example, we can extract the value range that can make the original rule expression true from all combinations, that is:
f(A, B, C, D) = (!A & !B & C & D) || (!A & B & C & D) || (!A & !B!C & D)
||(A&B!C&D)||(A&!B&!C&!D)||(A&!B&!C&D)||(A&!B&C&D)||(A&!B&C&D)||(A&!B&C&D)
用其十进制表示如下:
f(ABCD)=∑(1,3,5,7,8,9,10,11)
Its decimal representation is as follows:
f(ABCD)=∑(1,3,5,7,8,9,10,11)
步骤223,对所述二进制码组合中的多个二进制码进行同位循环二进制码合并,得到简化二进制码组合;Step 223, performing co-position cyclic binary code merging on multiple binary codes in the binary code combination to obtain a simplified binary code combination;
在一种实施方式中,上述步骤223具体可以包括:In one implementation, the above step 223 may specifically include:
(1)将所述二进制码组合中的二进制码进行两两比较,合并生成新二进制码;(1) comparing the binary codes in the binary code combination in pairs and combining them to generate a new binary code;
更具体地,当两个二进制码仅存在一个不同的二进制位时,将该不同的二进制位设为设定符号,并保持其余相同的二进制位不变,作为新二进制码。例如,针对“0000”和“0010”,其存在1位区别,可合并为“00*0”,其中“*”为该设定符号。More specifically, when two binary codes have only one different binary bit, the different binary bit is set as the setting symbol, and the remaining same binary bits are kept unchanged as the new binary code. For example, for "0000" and "0010", there is a 1-bit difference, which can be combined into "00*0", where "*" is the setting symbol.
(2)将所述新二进制码和未能合并的原有二进制码进行两两比较,合并生成新二进制码并去除重复二进制码;(2) comparing the new binary code with the original binary code that could not be merged, merging them to generate a new binary code and removing duplicate binary codes;
重复循环上述合并步骤(1)、(2),直至无法再次合并生成新二进制数为止。Repeat the above merging steps (1) and (2) until no new binary numbers can be generated.
可选地,也可以采用其他同位循环二进制码合并方式,本申请对此不作具体限制。Optionally, other co-location cyclic binary code merging methods may also be used, and this application does not impose any specific limitation on this.
步骤224,将所述简化二进制码组合中的各个二进制位转化回所述关键要素,输出所述最简规则表达式。Step 224: convert each binary bit in the simplified binary code combination back to the key element, and output the simplest regular expression.
在一种实施方式中,上述步骤224具体可以包括以下一种或多种操作:In one implementation, the above step 224 may specifically include one or more of the following operations:
(1)针对简化二进制码组合中的每个二进制码,按照二进制位的位置转化为对应的关键要素;(1) For each binary code in the simplified binary code combination, convert it into a corresponding key element according to the position of the binary bit;
(2)根据每个二进制位的取值对所述关键要素进行取非操作或不取非操作;以及,(2) performing a negation operation or not performing a negation operation on the key element according to the value of each binary bit; and
(3)若所述二进制码中包括取值为所述设定符号的二进制位,则忽略对应的关键要素。(3) If the binary code includes a binary bit whose value is the set symbol, the corresponding key element is ignored.
例如,将“0000”转化为“!A!B!C!D”,将“0011”转化为“!A!BCD”,将“11*1”转化为“ABD”,诸如此类。For example, convert "0000" into "!A!B!C!D", convert "0011" into "!A!BCD", convert "11*1" into "ABD", and so on.
示例性地,合并过程如下:

Exemplarily, the merging process is as follows:

即可将原规则表达式进行等价化简:The original rule expression can be simplified equivalently:
原规则表达式:F=(!A&C&D)|(!A&!C&D)|(A&!B&D)|(A&!B&!D)|(A&!B&C&D)Original rule expression: F = (!A&C&D)|(!A&!C&D)|(A&!B&D)|(A&!B&!D)|(A&!B&C&D)
最简规则表达式:F=(!AD)|(A!B)The simplest regular expression: F = (!AD) | (A!B)
本实施例中,通过上述基于循环二进制码的化简算法,化简用户配置的规则表达式,自动去除和化简冗余规则表达式片段,进行无损转换为最简规则表达式,基于化简后规则表达式生成匹配树。In this embodiment, the above-mentioned simplification algorithm based on cyclic binary code is used to simplify the rule expression configured by the user, automatically remove and simplify redundant rule expression fragments, perform lossless conversion into the simplest rule expression, and generate a matching tree based on the simplified rule expression.
本实施例中,通过表达式的最简无损转换,将用户配置的文本规则无损化简为最简规则表达式,极大简化了后续的匹配过程,提高匹配效率。In this embodiment, the text rules configured by the user are losslessly simplified into the simplest regular expressions through the simplest lossless conversion of the expressions, which greatly simplifies the subsequent matching process and improves the matching efficiency.
步骤230,基于谓词演算算法将所述最简规则表达式等价转化为规则表达式匹配树;Step 230, converting the simplest regular expression into a regular expression matching tree based on a predicate calculus algorithm;
在一种实施方式中,上述步骤230具体可以包括:首先,获取所述最简规则表达式对应的规则树,随后重复执行以下一种或多种谓词推演算法直至稳定,得到所述规则表达式匹配树:In one implementation, the above step 230 may specifically include: first, obtaining a rule tree corresponding to the simplest regular expression, and then repeatedly executing the following one or more predicate deduction algorithms until they are stable, thereby obtaining the regular expression matching tree:
(1)若所述规则树的非运算的子节点有多个,则将所述非运算下推到子节点中,并将与运算符和或运算符互换;(1) If there are multiple child nodes of the rule tree that do not have an operation, the non-operation is pushed down to the child nodes and is interchanged with the operator and the or operator;
例如,参考图4,如果非运算的子节点有多个,则将非运算下推到子节点中,其中与运算“&”和或运算“|”互换,并且下推后增加括号。For example, referring to FIG. 4 , if there are multiple child nodes of the negation operation, the negation operation is pushed down to the child nodes, wherein the AND operation “&” and the OR operation “|” are interchanged, and brackets are added after the push down.
若所述最简规则表达式的当前运算符与父节点运算符一致,则将当前运算符的子节点上移,并删除当前运算符;If the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up and the current operator is deleted;
例如,参考图5,当前运算符,与父节点运算符一致时,当前运算符的子节点上移,并删除当前运算符。For example, referring to FIG5 , when the current operator is consistent with the parent node operator, the child node of the current operator is moved up, and the current operator is deleted.
针对同一层叶子节点,按照节点唯一属性排序。For leaf nodes at the same level, sort them according to their unique attributes.
例如,节点内排序:同一层叶子节点,按照节点唯一属性排序,例如字段ID升序或降序,同一层非叶子节点排在后面。非叶子节点的唯一属性由子节点的唯一属性构成,因此,同一层非叶子节点之间也按照唯一属性排序。For example, in node sorting, leaf nodes in the same layer are sorted by unique attributes of the nodes, such as ascending or descending order of field ID, and non-leaf nodes in the same layer are sorted at the back. The unique attributes of non-leaf nodes are composed of the unique attributes of their child nodes, so non-leaf nodes in the same layer are also sorted by unique attributes.
步骤240,将多个规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段;Step 240, merging multiple rule expression matching trees into a merged matching network, and identifying common rule fragments;
在一种实施方式中,上述步骤240进一步可以包括:In one implementation, the above step 240 may further include:
(1)选择一个规则表达式匹配树进行上下转置,将规则表达式匹配树方规则标识为根节点,作为所述合并匹配网络的初始状态;(1) selecting a regular expression matching tree to perform up-down transposition, and marking the square rule of the regular expression matching tree as a root node as the initial state of the merged matching network;
例如,参考图6,选择1个规则网络上下转置,将原来叶子节点为入口点,最终匹配命中的规则为终结点,该网络为合并网络的初始状态。上下转置后,要素节点在最上层。For example, referring to Figure 6, select a rule network and transpose it up and down, with the original leaf node as the entry point and the final matched rule as the end point. This network is the initial state of the merged network. After the up and down transposition, the element node is at the top layer.
(2)逐个遍历其他的规则表达式匹配树进行上下转置,并逐个融合进所述合并匹配网络中;(2) traversing other regular expression matching trees one by one, performing up-down transposition, and integrating them one by one into the merged matching network;
进一步地,为了逐个融合进所述合并匹配网络,还包括:对于单个规则表达式匹配树中的要素节点,新增或复用所述合并匹配网络中的要素节点;和/或,对于单个规则表达式 匹配树中的逻辑符节点,新增或复用所述合并匹配网络中的逻辑符节点;和/或,对于完全重合的逻辑符节点,通过反向搜索可提取出所述公共规则片段及其所属规则表达式匹配树,例如,参见图7,其中“条件5&条件6”为公共片段,则规则1和规则2共享;和/或,对于部分重合的逻辑符节点,拆分所述合并匹配网络中逻辑符节点,例如,参见图8。Furthermore, in order to merge them one by one into the merged matching network, it also includes: for the element nodes in the single regular expression matching tree, adding or reusing the element nodes in the merged matching network; and/or, for the single regular expression The logic symbol nodes in the matching tree are newly added or reused in the merged matching network; and/or, for completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search, for example, see Figure 7, where "Condition 5 & Condition 6" are common fragments, and Rule 1 and Rule 2 are shared; and/or, for partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split, for example, see Figure 8.
(3)遍历完成后,形成完整的合并匹配网络,并提取出公共规则片段以及所属的规则,同时记录每个规则的单个要素或衍生要素集合。(3) After the traversal is completed, a complete merged matching network is formed, and the common rule fragments and the rules to which they belong are extracted, while the single element or derived element set of each rule is recorded.
例如,将图9合并生成图10所示的合并匹配网络,其中的虚线框内的要素为公共规则片段,同时记录每个规则标识(规则1和规则2)所涉及的要素集合。For example, FIG9 is merged to generate the merged matching network shown in FIG10 , in which the elements in the dashed box are common rule fragments, and the element set involved in each rule identifier (rule 1 and rule 2) is recorded at the same time.
步骤250,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配。Step 250: perform feature matching on the to-be-matched data using the merged matching network and the common rule fragments.
在一种实施方式中,可以预先为所述合并匹配网络中的各个规则标识进行优先级排序;依据所述优先级依序将合并匹配网络中的每个规则标识所涉及的要素集合与所述待匹配数据进行匹配,直至匹配成功或匹配结束。In one implementation, each rule identifier in the merged matching network may be prioritized in advance; and the element set involved in each rule identifier in the merged matching network is matched with the data to be matched in sequence according to the priority until the match succeeds or ends.
在一种实施方式中,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,包括:In one embodiment, using the merged matching network and the common rule fragment to perform feature matching on the to-be-matched data includes:
(1)要素匹配,按照每个规则标识所涉及规则表达式匹配树的要素节点集合,从所述合并匹配网络的入口进入,将所述要素集合与所述待匹配数据匹配,同时缓存要素匹配结果。(1) Element matching: According to the element node set of the rule expression matching tree involved in each rule identification, enter from the entrance of the merged matching network, match the element set with the data to be matched, and cache the element matching results.
(2)逻辑匹配,如果匹配到逻辑节点,查询父节点的缓存是否已命中;结果可进一步分为两种情况:如果i)如无缓存结果,则从所述要素节点集合取下一个要素节点,与待匹配数据进行匹配;ii)如有缓存结果,则直接取所述缓存结果进行逻辑运算;并且,如所述逻辑节点属于公共规则片段,则缓存逻辑匹配结果。否则无需缓存,以节省空间。(2) Logical matching: if a logical node is matched, query the cache of the parent node to see if it has been hit; the result can be further divided into two cases: if i) there is no cache result, take the next element node from the element node set and match it with the data to be matched; ii) if there is a cache result, directly take the cache result for logical operation; and if the logical node belongs to a common rule fragment, cache the logical matching result. Otherwise, no cache is required to save space.
(3)如果匹配到最终节点,则返回命中的规则标识;如果中途不满足返回,则继续按照优先级次序匹配下一个规则。(3) If the final node is matched, the matching rule identifier is returned; if the matching rule is not met, the next rule is matched in the priority order.
需要说明的是,本实施方式中未作详细说明的步骤可以参考图1所示实施方式中相关步骤中的描述,此处不再赘述。It should be noted that the steps not described in detail in this embodiment can refer to the description of the relevant steps in the embodiment shown in FIG. 1 , and will not be repeated here.
在本说明书的描述中,参考术语“一些可能的实施方式”、“一些实施方式”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施方式或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施方式或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施方式或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施方式或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施方式或示例以及不同实施方式或示例的特征进行结合和组合。In the description of this specification, the description with reference to the terms "some possible embodiments", "some embodiments", "examples", "specific examples", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, unless they are contradictory.
关于本申请实施方式的方法流程图,将某些操作描述为以一定顺序执行的不同的步骤。这样的流程图属于说明性的而非限制性的。可以将在本文中所描述的某些步骤分组在一起并且在单个操作中执行、可以将某些步骤分割成多个子步骤、并且可以以不同于在本文中所示出的顺序来执行某些步骤。可以由任何电路结构和/或有形机制(例如,由在计算机设备上运行的软件、硬件(例如,处理器或芯片实现的逻辑功能)等、和/或其任何组合)以任何方式来实现在流程图中所示出的各个步骤。About the method flow chart of the present application embodiment, some operations are described as different steps performed in a certain order. Such flow chart belongs to illustrative and non-restrictive. Some steps described in this article can be grouped together and performed in a single operation, some steps can be divided into multiple sub-steps, and some steps can be performed in an order different from that shown in this article. Each step shown in the flow chart can be realized in any way by any circuit structure and/or tangible mechanism (for example, by software, hardware (for example, the logical function realized by processor or chip) etc. running on computer equipment and/or any combination thereof).
需要说明的是,本申请实施方式中的装置可以实现前述方法的实施方式的各个过程,并达到相同的效果和功能,这里不再赘述。 It should be noted that the device in the implementation mode of the present application can implement each process of the implementation mode of the aforementioned method and achieve the same effects and functions, which will not be repeated here.
根据本申请的一些实施方式,提供了规则表达式匹配方法的非易失性计算机存储介质,其上存储有计算机可执行指令,该计算机可执行指令设置为在由处理器运行时执行:上述实施方式所述的方法。According to some embodiments of the present application, a non-volatile computer storage medium of a regular expression matching method is provided, on which computer executable instructions are stored, and the computer executable instructions are configured to execute the method described in the above embodiments when executed by a processor.
本申请中的各个实施方式均采用递进的方式描述,各个实施方式之间相同相似的部分互相参见即可,每个实施方式重点说明的都是与其他实施方式的不同之处。尤其,对于装置、设备和计算机可读存储介质实施方式而言,由于其基本相似于方法实施方式,所以其描述进行了简化,相关之处可参见方法实施方式的部分说明即可。Each embodiment in this application is described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device, equipment, and computer-readable storage medium embodiments, since they are basically similar to the method embodiments, their descriptions are simplified, and the relevant parts can be referred to the partial description of the method embodiments.
本申请实施方式提供的装置、设备和计算机可读存储介质与方法是一一对应的,因此,装置、设备和计算机可读存储介质也具有与其对应的方法类似的有益技术效果,由于上面已经对方法的有益技术效果进行了详细说明,因此,这里不再赘述装置、设备和计算机可读存储介质的有益技术效果。The apparatus, equipment and computer-readable storage medium provided in the embodiments of the present application correspond one-to-one to the method. Therefore, the apparatus, equipment and computer-readable storage medium also have similar beneficial technical effects as the corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the apparatus, equipment and computer-readable storage medium will not be repeated here.
本领域内的技术人员应明白,本发明的实施方式可提供为方法、装置(设备或系统)、或计算机可读存储介质。因此,本发明可采用完全硬件实施方式、完全软件实施方式、或结合软件和硬件方面的实施方式的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机可读存储介质的形式。It will be appreciated by those skilled in the art that the embodiments of the present invention may be provided as methods, devices (equipment or system), or computer-readable storage media. Therefore, the present invention may be implemented in the form of a complete hardware implementation, a complete software implementation, or an implementation combining software and hardware. Moreover, the present invention may be implemented in the form of a computer-readable storage medium implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本发明是参照根据本发明实施方式的方法、装置(设备或系统)、和计算机可读存储介质的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to the flowchart and/or block diagram of the method, device (equipment or system) and computer-readable storage medium according to the embodiment of the present invention. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、 电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。此外,尽管在附图中以特定顺序描述了本发明方法的操作,但是,这并非要求或者暗示必须按照该特定顺序来执行这些操作,或是必须执行全部所示的操作才能实现期望的结果。附加地或备选地,可以省略某些步骤,将多个步骤合并为一个步骤执行,和/或将一个步骤分解为多个步骤执行。Computer-readable media include permanent and non-permanent, removable and non-removable media that can implement information storage by any method or technology. Information can be computer-readable instructions, data structures, program modules or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium may be used to store information that can be accessed by a computing device. In addition, although the operations of the method of the present invention are described in a particular order in the accompanying drawings, this does not require or imply that the operations must be performed in this particular order or that all of the operations shown must be performed to achieve the desired results. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step, and/or one step may be broken down into multiple steps.
虽然已经参考若干具体实施方式描述了本发明的精神和原理,但是应该理解,本发明并不限于所公开的具体实施方式,对各方面的划分也不意味着这些方面中的特征不能组合以进行受益,这种划分仅是为了表述的方便。本发明旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。 Although the spirit and principle of the present invention have been described with reference to several specific embodiments, it should be understood that the present invention is not limited to the disclosed specific embodiments, and the division of various aspects does not mean that the features in these aspects cannot be combined to benefit, and such division is only for the convenience of expression. The present invention is intended to cover various modifications and equivalent arrangements included in the spirit and scope of the attached claims.

Claims (18)

  1. 一种规则表达式匹配方法,其特征在于,包括:A regular expression matching method, characterized by comprising:
    接收规则文本串,对所述规则文本串进行语法校验,输出规则表达式;receiving a rule text string, performing syntax check on the rule text string, and outputting a rule expression;
    基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式;Based on the simplification algorithm of cyclic binary code, the regular expression is losslessly converted into the simplest regular expression;
    基于谓词演算算法将所述最简规则表达式等价转化为规则表达式匹配树;Based on the predicate calculus algorithm, the simplest regular expression is equivalently converted into a regular expression matching tree;
    将多个规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段;Merge multiple regular expression matching trees into a merged matching network and identify common rule fragments;
    利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配。The merged matching network and the common rule fragments are used to perform feature matching on the data to be matched.
  2. 根据权利要求1所述的方法,其特征在于,基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式,还包括:The method according to claim 1, characterized in that the regular expression is losslessly converted into a simplest regular expression based on a simplification algorithm of a cyclic binary code, and further comprises:
    获取所述规则表达式中的全部关键要素,基于每个关键要素的正反取值生成所述多个关键要素的全部组合;Obtain all key elements in the regular expression, and generate all combinations of the multiple key elements based on the positive and negative values of each key element;
    从全部组合中获取使所述规则表达式为真的的组合取值范围,并获取所述组合取值范围的二进制码组合;Obtain a combination value range that makes the regular expression true from all combinations, and obtain a binary code combination of the combination value range;
    对所述二进制码组合中的多个二进制码进行同位循环二进制码合并,得到简化二进制码组合;Performing co-position cyclic binary code merging on multiple binary codes in the binary code combination to obtain a simplified binary code combination;
    将所述简化二进制码组合中的各个二进制位转化回所述关键要素,输出所述最简规则表达式。Each binary bit in the simplified binary code combination is converted back to the key element, and the simplest regular expression is output.
  3. 根据权利要求2所述的方法,其特征在于,对所述多个二进制码进行同位循环二进制码合并,包括:The method according to claim 2, characterized in that the step of performing co-position cyclic binary code merging on the plurality of binary codes comprises:
    将所述二进制码组合中的二进制码进行两两比较,合并生成新二进制码;Comparing the binary codes in the binary code combination in pairs, and combining them to generate a new binary code;
    将所述新二进制码和未能合并的原有二进制码进行两两比较,合并生成新二进制码并去除重复二进制码;Compare the new binary code with the original binary code that cannot be merged, merge them to generate a new binary code and remove duplicate binary codes;
    重复循环上述合并步骤,直至无法再次合并生成新二进制数为止。Repeat the above merging steps until no new binary numbers can be generated.
  4. 根据权利要求2所述的方法,其特征在于,所述同位循环二进制码合并,还包括:The method according to claim 2, characterized in that the merging of the same-position cyclic binary codes further comprises:
    当两个二进制码仅存在一个不同的二进制位时,将该不同的二进制位设为设定符号,并保持其余相同的二进制位不变,作为新二进制码。When there is only one different binary bit between two binary codes, the different binary bit is set as a set symbol, and the other identical binary bits are kept unchanged as a new binary code.
  5. 根据权利要求3所述的方法,其特征在于,所述简化二进制码组合中的各个二进制位转化回所述关键要素,包括:The method according to claim 3, characterized in that converting each binary bit in the simplified binary code combination back to the key element comprises:
    针对简化二进制码组合中的每个二进制码,按照二进制位的位置转化为对应的关键要素;For each binary code in the simplified binary code combination, convert it into a corresponding key element according to the position of the binary bit;
    根据每个二进制位的取值对所述关键要素进行取非操作或不取非操作;以及,Performing a negation operation or not negating the key element according to the value of each binary bit; and
    若所述二进制码中包括取值为所述设定符号的二进制位,则忽略对应的关键要素。If the binary code includes a binary bit whose value is the set symbol, the corresponding key element is ignored.
  6. 根据权利要求1所述的方法,其特征在于,对所述规则文本串进行语法校验,还包括:利用上下文无关文法和递归下降算法,对所述规则文本串进行完备性语法校验。The method according to claim 1 is characterized in that the grammatical check of the regular text string also includes: using context-free grammar and recursive descent algorithm to perform completeness grammatical check on the regular text string.
  7. 根据权利要求1所述的方法,其特征在于,对所述规则文本串进行语法校验,还包括:The method according to claim 1, characterized in that the grammar checking of the regular text string further comprises:
    读入所述规则文本串,按照预定分隔符分割所述规则文本串,得到多个词素;Reading the regular text string, dividing the regular text string according to a predetermined separator to obtain a plurality of morphemes;
    按照所述规则文本串中的词素顺序,对每个词素进行排序,生成词法单元序列;Sort each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence;
    遍历所述词法单元序列,校验所述规则文本串的语法。 The lexical unit sequence is traversed to check the grammar of the regular text string.
  8. 根据权利要求1所述的方法,其特征在于,所述词素分为关键要素类型和逻辑运算类型。The method according to claim 1 is characterized in that the morphemes are divided into key element types and logical operation types.
  9. 根据权利要求1所述的方法,其特征在于,将所述最简规则表达式等价转化为规则表达式匹配树;还包括:The method according to claim 1, characterized in that the simplest regular expression is equivalently converted into a regular expression matching tree; and further comprising:
    重复执行以下一种或多种谓词推演算法直至稳定,得到所述规则表达式匹配树:Repeat the following one or more predicate deduction algorithms until they are stable, and obtain the regular expression matching tree:
    获取所述最简规则表达式对应的规则树;Obtain a rule tree corresponding to the simplest rule expression;
    若所述规则树的非运算的子节点有多个,则将所述非运算下推到子节点中,并将与运算符和或运算符互换;If there are multiple child nodes of the rule tree that do not have an operation, the non-operation is pushed down to the child nodes and is interchanged with the operator and the or operator;
    若所述最简规则表达式的当前运算符与父节点运算符一致,则将当前运算符的子节点上移,并删除当前运算符;If the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up and the current operator is deleted;
    针对同一层叶子节点,按照节点唯一属性排序。For leaf nodes at the same level, sort them according to their unique attributes.
  10. 根据权利要求1所述的方法,其特征在于,将多个所述规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段,包括:The method according to claim 1, characterized in that merging a plurality of the rule expression matching trees into a merged matching network and identifying common rule fragments comprises:
    选择一个规则表达式匹配树进行上下转置,将规则表达式匹配树方规则标识为根节点,作为所述合并匹配网络的初始状态;Select a regular expression matching tree to perform up-down transposition, and identify the regular expression matching tree square rule as a root node as the initial state of the merged matching network;
    逐个遍历其他的规则表达式匹配树进行上下转置,并逐个融合进所述合并匹配网络中;Traversing other regular expression matching trees one by one, performing up-down transposition, and integrating them one by one into the merged matching network;
    遍历完成后,形成完整的合并匹配网络,并提取出公共规则片段。After the traversal is completed, a complete merged matching network is formed and common rule fragments are extracted.
  11. 根据权利要求1所述的方法,其特征在于,逐个融合进所述合并匹配网络,还包括:The method according to claim 1, characterized in that the steps of fusing into the combined matching network one by one further include:
    对于单个规则表达式匹配树中的要素节点,新增或复用所述合并匹配网络中的要素节点;和/或,For the element nodes in a single regular expression matching tree, adding or reusing the element nodes in the merged matching network; and/or,
    对于单个规则表达式匹配树中的逻辑符节点,新增或复用所述合并匹配网络中的逻辑符节点;和/或,For logic symbol nodes in a single regular expression matching tree, adding or reusing logic symbol nodes in the merged matching network; and/or,
    对于完全重合的逻辑符节点,通过反向搜索可提取出所述公共规则片段及其所属规则表达式匹配树;和/或,For completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search; and/or,
    对于部分重合的逻辑符节点,拆分所述合并匹配网络中逻辑符节点。For partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split.
  12. 根据权利要求1所述的方法,其特征在于,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,包括:The method according to claim 1, characterized in that the step of performing feature matching on the to-be-matched data using the merged matching network and the common rule fragments comprises:
    预先为所述合并匹配网络中的各个规则标识进行优先级排序;Prioritizing each rule identifier in the merged matching network in advance;
    依据所述优先级依序将合并匹配网络中的每个规则标识所涉及的要素集合与所述待匹配数据进行匹配,直至匹配成功或匹配结束。The element set involved in each rule identifier in the merged matching network is matched with the to-be-matched data in sequence according to the priority until the match succeeds or ends.
  13. 根据权利要求1所述的方法,其特征在于,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,包括以下中的一种或多种操作:The method according to claim 1, characterized in that the feature matching of the to-be-matched data using the merged matching network and the common rule fragment comprises one or more of the following operations:
    将所述待匹配数据与每个规则标识所涉及规则表达式匹配树的要素节点集合进行匹配,从所述合并匹配网络的入口进入,如果匹配到所述规则表达式匹配树的要素节点,则缓存要素匹配结果。The data to be matched is matched with a set of element nodes of the regular expression matching tree involved in each rule identifier, and is entered from the entrance of the merged matching network. If an element node of the regular expression matching tree is matched, the element matching result is cached.
  14. 根据权利要求1所述的方法,其特征在于,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,还包括:The method according to claim 1, characterized in that the method further comprises: performing feature matching on the to-be-matched data using the merged matching network and the common rule fragments;
    如果匹配到所述规则表达式匹配树的逻辑节点,则在缓存中查询所述逻辑节点的父要素节点是否已命中,其中: If a logical node of the rule expression matching tree is matched, query the cache to see whether the parent element node of the logical node has been hit, wherein:
    如无缓存结果,则从所述要素节点集合取下一个要素节点,与待匹配数据进行匹配;If there is no cached result, then taking a next element node from the element node set and matching it with the data to be matched;
    如有缓存结果,则直接取所述缓存结果进行逻辑运算;并且,如所述逻辑节点属于公共规则片段,则缓存逻辑匹配结果。If there is a cached result, the cached result is directly taken for logical operation; and if the logical node belongs to a public rule fragment, the logical matching result is cached.
  15. 根据权利要求1所述的方法,其特征在于,利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配,还包括:The method according to claim 1, characterized in that the feature matching of the to-be-matched data is performed using the merged matching network and the common rule fragment, further comprising:
    如果匹配到所述规则表达式匹配树的规则标识节点,则返回命中的规则标识。If a rule identifier node of the rule expression matching tree is matched, the matching rule identifier is returned.
  16. 一种规则表达式匹配装置,其特征在于,被配置为用于执行如权利要求1-15中任一项所述的方法,该装置包括:A regular expression matching device, characterized in that it is configured to perform the method according to any one of claims 1 to 15, and comprises:
    语法校验器,用于接收规则文本串,对所述规则文本串进行语法校验,输出规则表达式;A grammar checker, used for receiving a regular text string, performing grammar check on the regular text string, and outputting a regular expression;
    特征转换器,用于基于循环二进制码的化简算法,将所述规则表达式无损转换为最简规则表达式;A feature converter, used for losslessly converting the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code;
    谓词演算器,用于基于谓词演算算法将所述最简规则表达式等价转化为规则表达式匹配树;A predicate calculator, used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm;
    网络合并器,用于将多个规则表达式匹配树合并为合并匹配网络,并识别出公共规则片段;A network merger, used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments;
    特征匹配器,用于利用所述合并匹配网络和公共规则片段对待匹配数据进行特征匹配。The feature matcher is used to perform feature matching on the data to be matched by using the merged matching network and the common rule fragment.
  17. 一种规则表达式匹配装置,其特征在于,包括:A regular expression matching device, characterized by comprising:
    至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行:如权利要求1-15中任一项所述的方法。At least one processor; and, a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor so as to enable the at least one processor to execute: a method as described in any one of claims 1-15.
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有程序,当所述程序被多核处理器执行时,使得所述多核处理器执行如权利要求1-15中任一项所述的方法。 A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program, and when the program is executed by a multi-core processor, the multi-core processor executes the method as described in any one of claims 1-15.
PCT/CN2023/134854 2022-11-29 2023-11-28 Rule expression matching method and apparatus, and computer-readable storage medium WO2024114655A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211515709.6A CN116089663A (en) 2022-11-29 2022-11-29 Rule expression matching method and device and computer readable storage medium
CN202211515709.6 2022-11-29

Publications (1)

Publication Number Publication Date
WO2024114655A1 true WO2024114655A1 (en) 2024-06-06

Family

ID=86198195

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/134854 WO2024114655A1 (en) 2022-11-29 2023-11-28 Rule expression matching method and apparatus, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN116089663A (en)
WO (1) WO2024114655A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089663A (en) * 2022-11-29 2023-05-09 中国银联股份有限公司 Rule expression matching method and device and computer readable storage medium
CN117114142B (en) * 2023-10-23 2024-05-03 深圳市华傲数据技术有限公司 AI-based data rule expression generation method, apparatus, device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551757A (en) * 2009-01-09 2009-10-07 南京大学 Matching method of heuristic events based on predicate covering
US20170329788A1 (en) * 2016-05-10 2017-11-16 International Business Machines Corporation Rule generation in a data governance framework
CN112463819A (en) * 2020-11-26 2021-03-09 北京宏景世纪软件股份有限公司 Computing method, device and equipment based on Chinese expression and storage medium
CN114564624A (en) * 2022-02-11 2022-05-31 中国银联股份有限公司 Feature matching rule construction method, feature matching device, feature matching equipment and feature matching medium
CN116089663A (en) * 2022-11-29 2023-05-09 中国银联股份有限公司 Rule expression matching method and device and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551757A (en) * 2009-01-09 2009-10-07 南京大学 Matching method of heuristic events based on predicate covering
US20170329788A1 (en) * 2016-05-10 2017-11-16 International Business Machines Corporation Rule generation in a data governance framework
CN112463819A (en) * 2020-11-26 2021-03-09 北京宏景世纪软件股份有限公司 Computing method, device and equipment based on Chinese expression and storage medium
CN114564624A (en) * 2022-02-11 2022-05-31 中国银联股份有限公司 Feature matching rule construction method, feature matching device, feature matching equipment and feature matching medium
CN116089663A (en) * 2022-11-29 2023-05-09 中国银联股份有限公司 Rule expression matching method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN116089663A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
WO2024114655A1 (en) Rule expression matching method and apparatus, and computer-readable storage medium
RU2605077C2 (en) Method and system for storing and searching information extracted from text documents
JP6160259B2 (en) Character string search method, character string search device, and character string search program
US20110153641A1 (en) System and method for regular expression matching with multi-strings and intervals
US9311058B2 (en) Jabba language
US9229691B2 (en) Method and apparatus for programming assistance
CN108431766B (en) Method and system for accessing a database
CA2809021C (en) Systems and methods for lexicon generation
US20170277811A1 (en) Efficient conditional state mapping in a pattern matching automaton
WO2016046223A1 (en) Efficient pattern matching
Bille et al. Faster regular expression matching
CN116483850A (en) Data processing method, device, equipment and medium
CN112148359B (en) Distributed code clone detection and search method, system and medium based on subblock filtering
CN111930701A (en) Log structured processing method and device
CN112612810A (en) Slow SQL statement identification method and system
JP2003242179A (en) Character string collating method, document processing device using the method and program
CN108304467B (en) Method for matching between texts
CN114880523A (en) Character string processing method and device, electronic equipment and storage medium
CN112988778B (en) Method and device for processing database query script
KR102146625B1 (en) Apparatus and method for computing incrementally infix probabilities based on automata
US10936241B2 (en) Method, apparatus, and computer program product for managing datasets
CN113626465B (en) Database and method for realizing session-level variables in postgresql database
JPWO2020049622A1 (en) Information processing equipment, analysis system, analysis method and analysis program
CN114443685A (en) SQL injection detection method and device
CN117313149A (en) SPL-based security data association query method, device, equipment and medium