WO2024114655A1

WO2024114655A1 - Rule expression matching method and apparatus, and computer-readable storage medium

Info

Publication number: WO2024114655A1
Application number: PCT/CN2023/134854
Authority: WO
Inventors: 李�瑞
Original assignee: 中国银联股份有限公司
Priority date: 2022-11-29
Filing date: 2023-11-28
Publication date: 2024-06-06
Also published as: CN116089663A

Abstract

Provided in the present invention are a rule expression matching method and apparatus, and a computer-readable storage medium. The method comprises: receiving a rule text string, performing syntax validation on the rule text string, and outputting a rule expression (210); on the basis of a reduction algorithm for cyclic binary codes, converting the rule expression into the simplest rule expression in a lossless manner (220); on the basis of a predicate calculus algorithm, equivalently converting the simplest rule expression into a rule expression matching tree (230); merging a plurality of rule expression matching trees into a merged matching network, and identifying a common rule fragment (240); and using the merged matching network and the common rule fragment to perform feature matching on data to be subjected to matching (250). By using the method, the matching efficiency can be increased.

Description

A regular expression matching method, device and computer readable storage medium

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to the Chinese patent application filed with the China Patent Office on November 29, 2022, with application number 202211515709.6 and application name “A regular expression matching method, device and computer-readable storage medium”, the entire contents of which are incorporated by reference in this application.

Technical Field

The present invention belongs to the field of feature matching, and in particular relates to a regular expression matching method, device and computer-readable storage medium.

Background technique

This section is intended to provide a background or context to embodiments of the invention that are recited in the claims. No description herein is admitted to be prior art by inclusion in this section.

Some matching devices in the prior art use a line list method, which cannot configure more complex regular expression patterns, and have low scalability and flexibility. Another part of the devices uses a regular expression method, but to determine whether the regular expression satisfies the established syntax, they often use a hard-coded text character parsing method or regular matching (the regular matching algorithm is based on a finite state machine and cannot operate on an infinite number of elements that need to be calculated). These verification methods are only applicable to simple scenarios and cannot perform completeness verification on all combination scenarios of the regular expression pattern. For the regular expressions configured for business purposes, there is a lack of effective equivalent predicate operations, which simplifies the user-configured regular expressions into the simplest expressions, and lacks the recognition of common fragments of regular expressions, which affects the subsequent matching efficiency.

Therefore, how to improve matching efficiency is an urgent problem to be solved.

Summary of the invention

In view of the problems existing in the above-mentioned prior art, a regular expression matching method, device and computer-readable storage medium are proposed. The above-mentioned problems can be solved by using this method, device and computer-readable storage medium.

The present invention provides the following solutions.

In a first aspect, a regular expression matching method is provided, comprising: receiving a regular text string, performing a syntax check on the regular text string, and outputting a regular expression; based on a simplification algorithm for a cyclic binary code, losslessly converting the regular expression into a simplest regular expression; based on a predicate calculus algorithm, equivalently converting the simplest regular expression into a regular expression matching tree; merging multiple regular expression matching trees into a merged matching network, and identifying common rule fragments; and performing feature matching on data to be matched using the merged matching network and the common rule fragments.

In one embodiment, the regular expression is losslessly converted into a simplest regular expression based on a simplification algorithm based on a cyclic binary code, and further includes: obtaining all key elements in the regular expression, and generating all combinations of the multiple key elements based on the positive and negative values of each key element; obtaining a combination value range that makes the regular expression true from all combinations, and obtaining a binary code combination of the combination value range; performing a same-bit cyclic binary code merge on multiple binary codes in the binary code combination to obtain a simplified binary code combination; converting each binary bit in the simplified binary code combination back into the key element, and outputting the simplest regular expression.

In one embodiment, performing the same position cycle binary code merging on the plurality of binary codes comprises: comparing the binary codes in the binary code combination in pairs, merging to generate a new binary code; comparing the new binary code with the uncompressed binary code; The original binary codes that can be combined are compared in pairs, combined to generate new binary codes and duplicate binary codes are removed; the above-mentioned merging steps are repeated until it is no longer possible to combine to generate new binary numbers.

In one embodiment, the merging of the same-bit cyclic binary codes further includes: when there is only one different binary bit in the two binary codes, setting the different binary bit as a set symbol, and keeping the other identical binary bits unchanged as a new binary code.

In one embodiment, each binary bit in the simplified binary code combination is converted back to the key element, including: for each binary code in the simplified binary code combination, converting it into a corresponding key element according to the position of the binary bit; performing a negation operation or no negation operation on the key element according to the value of each binary bit; and if the binary code includes a binary bit whose value is the set symbol, ignoring the corresponding key element.

In one embodiment, performing grammatical verification on the regular text string further includes: performing grammatical verification on the regular text string for completeness using a context-free grammar and a recursive descent algorithm.

In one embodiment, the grammar check of the regular text string also includes: reading the regular text string, dividing the regular text string according to predetermined delimiters to obtain multiple morphemes; sorting each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence; traversing the lexical unit sequence to check the grammar of the regular text string.

In one implementation, the morphemes are divided into a key element type and a logical operation type.

In one embodiment, the simplest regular expression is equivalently converted into a regular expression matching tree; it also includes: repeatedly executing one or more of the following predicate deduction algorithms until stable, to obtain the regular expression matching tree: obtaining a rule tree corresponding to the simplest regular expression; if there are multiple non-operation child nodes of the rule tree, the non-operation is pushed down to the child node, and the operator and or operator are interchanged; if the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up, and the current operator is deleted; for the leaf nodes at the same level, they are sorted according to the unique attributes of the nodes.

In one embodiment, a plurality of the rule expression matching trees are merged into a merged matching network, and common rule fragments are identified, including: selecting a rule expression matching tree for up-down transposition, identifying the rule expression matching tree square rule as the root node as the initial state of the merged matching network; traversing other rule expression matching trees one by one for up-down transposition, and integrating them into the merged matching network one by one; after the traversal is completed, a complete merged matching network is formed, and the common rule fragments are extracted.

In one embodiment, the integration into the merged matching network one by one also includes: for the element nodes in a single rule expression matching tree, adding or reusing the element nodes in the merged matching network; and/or, for the logic symbol nodes in a single rule expression matching tree, adding or reusing the logic symbol nodes in the merged matching network; and/or, for completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search; and/or, for partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split.

In one embodiment, the merged matching network and the common rule fragments are used to perform feature matching on the data to be matched, including: pre-prioritizing the various rule identifiers in the merged matching network; matching the element set involved in each rule identifier in the merged matching network with the data to be matched in sequence according to the priority until the match is successful or the match is completed.

In one embodiment, the merged matching network and the common rule fragments are used to perform feature matching on the data to be matched, including one or more of the following operations: matching the data to be matched with a set of element nodes of the rule expression matching tree involved in each rule identifier, entering from the entrance of the merged matching network, and caching the element matching results if an element node of the rule expression matching tree is matched.

In one embodiment, the merged matching network and the common rule fragment are used to perform feature matching on the data to be matched, and the method also includes: if a logical node of the rule expression matching tree is matched, querying in the cache whether the parent element node of the logical node has been hit, wherein: if there is no cached result, taking the next element node from the element node set to match the data to be matched; if there is a cached result, directly taking the cached result for logical operation; and, if the logical node belongs to a common rule fragment, caching the logical matching result.

In one embodiment, using the merged matching network and the common rule fragment to perform feature matching on the to-be-matched data further includes: if a rule identification node of the rule expression matching tree is matched, returning a hit rule identification.

In a second aspect, a regular expression matching device is provided, which is configured to execute any method as described in the first aspect, and the device includes: a grammar checker, which is used to receive a regular text string, perform grammar check on the regular text string, and output a regular expression; a feature converter, which is used to losslessly convert the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code; a predicate calculator, which is used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm; a network merger, which is used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments; and a feature matcher, which is used to perform feature matching on data to be matched using the merged matching network and the common rule fragments.

According to a third aspect, a regular expression matching device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute: the method according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores a program, and when the program is executed by a multi-core processor, the multi-core processor executes the method of the first aspect.

One of the advantages of the above implementation is that it can significantly improve the matching efficiency.

Other advantages of the present invention will be explained in more detail with reference to the following description and accompanying drawings.

It should be understood that the above description is only an overview of the technical solution of the present invention, so that the technical means of the present invention can be more clearly understood and implemented according to the contents of the specification. In order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand, the specific implementation methods of the present invention are described below by way of example.

BRIEF DESCRIPTION OF THE DRAWINGS

The advantages and benefits described herein and other advantages and benefits will be apparent to those of ordinary skill in the art upon reading the detailed description of the exemplary embodiments below. The accompanying drawings are only for the purpose of illustrating exemplary embodiments and are not to be considered as limiting the present invention. Also, the same reference numerals are used throughout the accompanying drawings to represent the same components. In the accompanying drawings:

FIG1 is a schematic diagram of the structure of a regular expression matching device according to an embodiment of the present invention;

FIG2 is a schematic diagram of a flow chart of a regular expression matching method according to an embodiment of the present invention;

FIG3 is a schematic diagram of a rule tree of a regular expression according to an embodiment of the present invention;

FIG4 is a schematic diagram of rule tree conversion according to an embodiment of the present invention;

FIG5 is a schematic diagram of rule tree conversion according to an embodiment of the present invention;

FIG6 is a schematic diagram of a rule tree inversion according to an embodiment of the present invention;

FIG7 is a schematic diagram of rule tree merging according to an embodiment of the present invention;

FIG8 is a schematic diagram of rule tree merging according to another embodiment of the present invention;

FIG9 is a schematic diagram of a rule tree according to an embodiment of the present invention;

FIG10 is a schematic diagram of rule tree merging according to an embodiment of the present invention;

In the drawings, the same or corresponding reference numerals represent the same or corresponding parts.

Detailed ways

The exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although the exemplary embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments described herein. On the contrary, these embodiments are provided in order to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

In the description of the embodiments of the present application, it should be understood that terms such as "including" or "having" are intended to indicate the presence of features, numbers, steps, behaviors, components, parts, or a combination thereof disclosed in this specification, and are not intended to exclude the possibility of the presence of one or more other features, numbers, steps, behaviors, components, parts, or a combination thereof.

Unless otherwise specified, “/” means or. For example, A/B can mean A or B. The “and/or” in this article is merely a way to describe the association relationship of associated objects, indicating that three relationships can exist. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone.

The terms "first", "second", etc. are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Thus, a feature defined as "first", "second", etc. may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, "plurality" means two or more.

FIG1 shows an exemplary regular expression matching device, which includes: a grammar checker 110, which is used to receive a regular text string, perform grammar check on the regular text string, output a regular expression, and ensure the rationality of the matching regular expression; a feature converter 120, which is used to losslessly convert the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code; a predicate calculator 130, which is used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm, so as to prepare for the subsequent identification of common fragments; a network merger 140, which is used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments; a feature matcher 150, which is used to perform feature matching on the data to be matched using the merged matching network and the common rule fragments. In this way, through a series of simplification steps, the matching efficiency can be significantly improved.

2 shows a flow chart of a method for performing regular expression matching according to an embodiment of the present disclosure. It should be understood that the method 200 may also include additional blocks not shown and/or may omit the blocks shown, and the scope of the present disclosure is not limited in this respect.

Step 210, receiving a rule text string, performing syntax check on the rule text string, and outputting a rule expression;

In one embodiment, the regular text string is checked for grammatical completeness using a context-free grammar and a recursive descent algorithm.

In one embodiment, in order to implement grammatical verification of a regular text string, the following steps may be specifically performed: read in the regular text string, split the regular text string according to predetermined delimiters to obtain a plurality of morphemes, wherein morphemes may be divided into key element types and logical operation types; sort each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence; traverse the lexical unit sequence to verify the grammar of the regular text string.

Specifically, each potentially applicable rule can be configured as a corresponding text string, that is, a rule text string. For example, if there are currently the following two rule text strings:

The electronic device can receive the regular text string configured by the developer, etc., and can split the regular text string according to the agreed delimiter for each received regular text string, remove the extra spaces and line breaks, and obtain each single morpheme. For example, by splitting according to the comma "," you can get each morpheme, query the lexical unit attribute table, and obtain the morpheme attribute information of each morpheme:

Morphemes are divided into key element types and logical operation types. Logical operation types are mainly composed of &, |, NOT!, and brackets (); key element types are related to the meaning of specific business scenarios, such as foreign card internal use and non-financial institutions.

Each morpheme can be sorted according to its order in the regular text string. For the convenience of description, the sorted morphemes are called lexical unit sequences. The model created based on the context-free grammar can be called a production model G = (N, Σ, P, S).

Exemplarily, a schematic diagram of a production model provided by some embodiments is shown below, which can be based on a production model created based on a context-free grammar conforming to extended Backus-Naur Form (EBNF), perform grammatical analysis on a regular text string and generate a rule tree corresponding to the regular text string.

Exemplarily, when performing grammatical analysis on a regular text string based on a production model created by a context-free grammar, each morpheme in the lexical unit sequence corresponding to the regular text string can be read in order, and a top-down recursive descent algorithm can be used, and a symbol is looked ahead each time, and the look-ahead symbol is used to guide the selection of grammatical rules (analysis functions) to determine an analysis function applicable to each morpheme. Exemplarily, there are 5 types of analysis functions, namely: expression analysis function expr(), or analysis function or(), and left analysis function andLeftCond(), and right analysis function andRightCond(), and non-analysis function notCond(). Each analysis function can be processed according to a conventional recursive descent algorithm, and then the parsed lexical unit sequence is traversed once, the grammar of the regular expression is verified, and a rule tree corresponding to the regular text string is generated.

Specifically, the process of establishing a rule tree corresponding to the rule text string may include:

Each key element contained in the rule text string is respectively included as a node in the rule tree corresponding to the rule text string, and the logical operators in the rule text string are also respectively included as a node in the rule tree. The nodes that have an associated relationship in the rule text string can be connected to establish the rule tree corresponding to the rule text string.

In a possible implementation, when establishing a rule tree corresponding to a regular text string, if the analysis function applicable to the morpheme in the lexical unit sequence of the regular text string is or analysis function or(), and right analysis function andLeftCond(), non-analysis function notCond(), etc., newly added matching intermediate nodes or (or), and (and), or (or), etc. can be established in the rule tree at the same time. If the morpheme (element) in the lexical unit sequence of the regular text string is a terminal symbol (TOK_COND), a corresponding leaf node can be added in the lower layer of the associated intermediate node. After all lexical units are traversed, if the grammar is satisfied, the rule tree shown in Figure 3 is established at the same time.

In this embodiment, a complete grammar verification method suitable for regular expression logical operations is created, which uses a recursive descent algorithm to complete the grammar verification of regular expressions only once, and can perform grammar checking on unlimited logical operation combinations.

Step 220, based on the simplification algorithm of cyclic binary code, losslessly convert the regular expression into a simplest regular expression;

In one implementation, the above step 220 may further include:

Step 221, obtaining all key elements in the regular expression, and generating all combinations of the multiple key elements based on the positive and negative values of each key element;

For example, based on the regular expression outputted in the aforementioned step 210:
F = (!A&C&D)|(!A&!C&D)|(A&!B&D)|(A&!B&!D)|(A&!B&C&D)

Among them, the key elements are A, B, C, and D. Each key element has a value of 0 or 1. Assuming that the key element represents 1, the value of the key element (negated) is 0. The value range of the key element is 2^N=2^4=16 (2 times N N is the number of key elements), using the binary string format: 0000, 0001, 0010, 0011, 0100, ..., 1111. The corresponding relationship is as follows: !A!B!C!D=0000, !A!B!CD=0001, !A!BC!D=0010, ...,ABCD=1111.

The above binary string is represented in decimal, that is:

! A! B! C! D=0000=0,! A! B! CD=0001=1,! A! BC! D=0010=2, ..., ABCD=1111=15.

Step 222, obtaining a combination value range that makes the regular expression true from all combinations, and obtaining a binary code combination of the combination value range;

For example, we can extract the value range that can make the original rule expression true from all combinations, that is:
f(A, B, C, D) = (!A & !B & C & D) || (!A & B & C & D) || (!A & !B!C & D)
||(A&B!C&D)||(A&!B&!C&!D)||(A&!B&!C&D)||(A&!B&C&D)||(A&!B&C&D)||(A&!B&C&D)

Its decimal representation is as follows:
f(ABCD)=∑(1,3,5,7,8,9,10,11)

Step 223, performing co-position cyclic binary code merging on multiple binary codes in the binary code combination to obtain a simplified binary code combination;

In one implementation, the above step 223 may specifically include:

(1) comparing the binary codes in the binary code combination in pairs and combining them to generate a new binary code;

More specifically, when two binary codes have only one different binary bit, the different binary bit is set as the setting symbol, and the remaining same binary bits are kept unchanged as the new binary code. For example, for "0000" and "0010", there is a 1-bit difference, which can be combined into "00*0", where "*" is the setting symbol.

(2) comparing the new binary code with the original binary code that could not be merged, merging them to generate a new binary code and removing duplicate binary codes;

Repeat the above merging steps (1) and (2) until no new binary numbers can be generated.

Optionally, other co-location cyclic binary code merging methods may also be used, and this application does not impose any specific limitation on this.

Step 224: convert each binary bit in the simplified binary code combination back to the key element, and output the simplest regular expression.

In one implementation, the above step 224 may specifically include one or more of the following operations:

(1) For each binary code in the simplified binary code combination, convert it into a corresponding key element according to the position of the binary bit;

(2) performing a negation operation or not performing a negation operation on the key element according to the value of each binary bit; and

(3) If the binary code includes a binary bit whose value is the set symbol, the corresponding key element is ignored.

For example, convert "0000" into "!A!B!C!D", convert "0011" into "!A!BCD", convert "11*1" into "ABD", and so on.

Exemplarily, the merging process is as follows:

The original rule expression can be simplified equivalently:

Original rule expression: F = (!A&C&D)|(!A&!C&D)|(A&!B&D)|(A&!B&!D)|(A&!B&C&D)

The simplest regular expression: F = (!AD) | (A!B)

In this embodiment, the above-mentioned simplification algorithm based on cyclic binary code is used to simplify the rule expression configured by the user, automatically remove and simplify redundant rule expression fragments, perform lossless conversion into the simplest rule expression, and generate a matching tree based on the simplified rule expression.

In this embodiment, the text rules configured by the user are losslessly simplified into the simplest regular expressions through the simplest lossless conversion of the expressions, which greatly simplifies the subsequent matching process and improves the matching efficiency.

Step 230, converting the simplest regular expression into a regular expression matching tree based on a predicate calculus algorithm;

In one implementation, the above step 230 may specifically include: first, obtaining a rule tree corresponding to the simplest regular expression, and then repeatedly executing the following one or more predicate deduction algorithms until they are stable, thereby obtaining the regular expression matching tree:

(1) If there are multiple child nodes of the rule tree that do not have an operation, the non-operation is pushed down to the child nodes and is interchanged with the operator and the or operator;

For example, referring to FIG. 4 , if there are multiple child nodes of the negation operation, the negation operation is pushed down to the child nodes, wherein the AND operation “&” and the OR operation “|” are interchanged, and brackets are added after the push down.

If the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up and the current operator is deleted;

For example, referring to FIG5 , when the current operator is consistent with the parent node operator, the child node of the current operator is moved up, and the current operator is deleted.

For leaf nodes at the same level, sort them according to their unique attributes.

For example, in node sorting, leaf nodes in the same layer are sorted by unique attributes of the nodes, such as ascending or descending order of field ID, and non-leaf nodes in the same layer are sorted at the back. The unique attributes of non-leaf nodes are composed of the unique attributes of their child nodes, so non-leaf nodes in the same layer are also sorted by unique attributes.

Step 240, merging multiple rule expression matching trees into a merged matching network, and identifying common rule fragments;

In one implementation, the above step 240 may further include:

(1) selecting a regular expression matching tree to perform up-down transposition, and marking the square rule of the regular expression matching tree as a root node as the initial state of the merged matching network;

For example, referring to Figure 6, select a rule network and transpose it up and down, with the original leaf node as the entry point and the final matched rule as the end point. This network is the initial state of the merged network. After the up and down transposition, the element node is at the top layer.

(2) traversing other regular expression matching trees one by one, performing up-down transposition, and integrating them one by one into the merged matching network;

Furthermore, in order to merge them one by one into the merged matching network, it also includes: for the element nodes in the single regular expression matching tree, adding or reusing the element nodes in the merged matching network; and/or, for the single regular expression The logic symbol nodes in the matching tree are newly added or reused in the merged matching network; and/or, for completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search, for example, see Figure 7, where "Condition 5 & Condition 6" are common fragments, and Rule 1 and Rule 2 are shared; and/or, for partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split, for example, see Figure 8.

(3) After the traversal is completed, a complete merged matching network is formed, and the common rule fragments and the rules to which they belong are extracted, while the single element or derived element set of each rule is recorded.

For example, FIG9 is merged to generate the merged matching network shown in FIG10 , in which the elements in the dashed box are common rule fragments, and the element set involved in each rule identifier (rule 1 and rule 2) is recorded at the same time.

Step 250: perform feature matching on the to-be-matched data using the merged matching network and the common rule fragments.

In one implementation, each rule identifier in the merged matching network may be prioritized in advance; and the element set involved in each rule identifier in the merged matching network is matched with the data to be matched in sequence according to the priority until the match succeeds or ends.

In one embodiment, using the merged matching network and the common rule fragment to perform feature matching on the to-be-matched data includes:

(1) Element matching: According to the element node set of the rule expression matching tree involved in each rule identification, enter from the entrance of the merged matching network, match the element set with the data to be matched, and cache the element matching results.

(2) Logical matching: if a logical node is matched, query the cache of the parent node to see if it has been hit; the result can be further divided into two cases: if i) there is no cache result, take the next element node from the element node set and match it with the data to be matched; ii) if there is a cache result, directly take the cache result for logical operation; and if the logical node belongs to a common rule fragment, cache the logical matching result. Otherwise, no cache is required to save space.

(3) If the final node is matched, the matching rule identifier is returned; if the matching rule is not met, the next rule is matched in the priority order.

It should be noted that the steps not described in detail in this embodiment can refer to the description of the relevant steps in the embodiment shown in FIG. 1 , and will not be repeated here.

In the description of this specification, the description with reference to the terms "some possible embodiments", "some embodiments", "examples", "specific examples", or "some examples" etc. means that the specific features, structures, materials or characteristics described in conjunction with the embodiment or example are included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials or characteristics described may be combined in any one or more embodiments or examples in a suitable manner. In addition, those skilled in the art may combine and combine the different embodiments or examples described in this specification and the features of the different embodiments or examples, unless they are contradictory.

About the method flow chart of the present application embodiment, some operations are described as different steps performed in a certain order. Such flow chart belongs to illustrative and non-restrictive. Some steps described in this article can be grouped together and performed in a single operation, some steps can be divided into multiple sub-steps, and some steps can be performed in an order different from that shown in this article. Each step shown in the flow chart can be realized in any way by any circuit structure and/or tangible mechanism (for example, by software, hardware (for example, the logical function realized by processor or chip) etc. running on computer equipment and/or any combination thereof).

It should be noted that the device in the implementation mode of the present application can implement each process of the implementation mode of the aforementioned method and achieve the same effects and functions, which will not be repeated here.

According to some embodiments of the present application, a non-volatile computer storage medium of a regular expression matching method is provided, on which computer executable instructions are stored, and the computer executable instructions are configured to execute the method described in the above embodiments when executed by a processor.

Each embodiment in this application is described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the device, equipment, and computer-readable storage medium embodiments, since they are basically similar to the method embodiments, their descriptions are simplified, and the relevant parts can be referred to the partial description of the method embodiments.

The apparatus, equipment and computer-readable storage medium provided in the embodiments of the present application correspond one-to-one to the method. Therefore, the apparatus, equipment and computer-readable storage medium also have similar beneficial technical effects as the corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the apparatus, equipment and computer-readable storage medium will not be repeated here.

It will be appreciated by those skilled in the art that the embodiments of the present invention may be provided as methods, devices (equipment or system), or computer-readable storage media. Therefore, the present invention may be implemented in the form of a complete hardware implementation, a complete software implementation, or an implementation combining software and hardware. Moreover, the present invention may be implemented in the form of a computer-readable storage medium implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

The present invention is described with reference to the flowchart and/or block diagram of the method, device (equipment or system) and computer-readable storage medium according to the embodiment of the present invention. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

In a typical configuration, a computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.

Memory may include non-permanent storage in a computer-readable medium, in the form of random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media include permanent and non-permanent, removable and non-removable media that can implement information storage by any method or technology. Information can be computer-readable instructions, data structures, program modules or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), Electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disk read-only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium may be used to store information that can be accessed by a computing device. In addition, although the operations of the method of the present invention are described in a particular order in the accompanying drawings, this does not require or imply that the operations must be performed in this particular order or that all of the operations shown must be performed to achieve the desired results. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step, and/or one step may be broken down into multiple steps.

Although the spirit and principle of the present invention have been described with reference to several specific embodiments, it should be understood that the present invention is not limited to the disclosed specific embodiments, and the division of various aspects does not mean that the features in these aspects cannot be combined to benefit, and such division is only for the convenience of expression. The present invention is intended to cover various modifications and equivalent arrangements included in the spirit and scope of the attached claims.

Claims

A regular expression matching method, characterized by comprising:

receiving a rule text string, performing syntax check on the rule text string, and outputting a rule expression;

Based on the simplification algorithm of cyclic binary code, the regular expression is losslessly converted into the simplest regular expression;

Based on the predicate calculus algorithm, the simplest regular expression is equivalently converted into a regular expression matching tree;

Merge multiple regular expression matching trees into a merged matching network and identify common rule fragments;

The merged matching network and the common rule fragments are used to perform feature matching on the data to be matched.
The method according to claim 1, characterized in that the regular expression is losslessly converted into a simplest regular expression based on a simplification algorithm of a cyclic binary code, and further comprises:

Obtain all key elements in the regular expression, and generate all combinations of the multiple key elements based on the positive and negative values of each key element;

Obtain a combination value range that makes the regular expression true from all combinations, and obtain a binary code combination of the combination value range;

Performing co-position cyclic binary code merging on multiple binary codes in the binary code combination to obtain a simplified binary code combination;

Each binary bit in the simplified binary code combination is converted back to the key element, and the simplest regular expression is output.
The method according to claim 2, characterized in that the step of performing co-position cyclic binary code merging on the plurality of binary codes comprises:

Comparing the binary codes in the binary code combination in pairs, and combining them to generate a new binary code;

Compare the new binary code with the original binary code that cannot be merged, merge them to generate a new binary code and remove duplicate binary codes;

Repeat the above merging steps until no new binary numbers can be generated.
The method according to claim 2, characterized in that the merging of the same-position cyclic binary codes further comprises:

When there is only one different binary bit between two binary codes, the different binary bit is set as a set symbol, and the other identical binary bits are kept unchanged as a new binary code.
The method according to claim 3, characterized in that converting each binary bit in the simplified binary code combination back to the key element comprises:

For each binary code in the simplified binary code combination, convert it into a corresponding key element according to the position of the binary bit;

Performing a negation operation or not negating the key element according to the value of each binary bit; and

If the binary code includes a binary bit whose value is the set symbol, the corresponding key element is ignored.
The method according to claim 1 is characterized in that the grammatical check of the regular text string also includes: using context-free grammar and recursive descent algorithm to perform completeness grammatical check on the regular text string.
The method according to claim 1, characterized in that the grammar checking of the regular text string further comprises:

Reading the regular text string, dividing the regular text string according to a predetermined separator to obtain a plurality of morphemes;

Sort each morpheme according to the order of morphemes in the regular text string to generate a lexical unit sequence;

The lexical unit sequence is traversed to check the grammar of the regular text string.
The method according to claim 1 is characterized in that the morphemes are divided into key element types and logical operation types.
The method according to claim 1, characterized in that the simplest regular expression is equivalently converted into a regular expression matching tree; and further comprising:

Repeat the following one or more predicate deduction algorithms until they are stable, and obtain the regular expression matching tree:

Obtain a rule tree corresponding to the simplest rule expression;

If there are multiple child nodes of the rule tree that do not have an operation, the non-operation is pushed down to the child nodes and is interchanged with the operator and the or operator;

If the current operator of the simplest regular expression is consistent with the parent node operator, the child node of the current operator is moved up and the current operator is deleted;

For leaf nodes at the same level, sort them according to their unique attributes.
The method according to claim 1, characterized in that merging a plurality of the rule expression matching trees into a merged matching network and identifying common rule fragments comprises:

Select a regular expression matching tree to perform up-down transposition, and identify the regular expression matching tree square rule as a root node as the initial state of the merged matching network;

Traversing other regular expression matching trees one by one, performing up-down transposition, and integrating them one by one into the merged matching network;

After the traversal is completed, a complete merged matching network is formed and common rule fragments are extracted.
The method according to claim 1, characterized in that the steps of fusing into the combined matching network one by one further include:

For the element nodes in a single regular expression matching tree, adding or reusing the element nodes in the merged matching network; and/or,

For logic symbol nodes in a single regular expression matching tree, adding or reusing logic symbol nodes in the merged matching network; and/or,

For completely overlapping logic symbol nodes, the common rule fragment and its corresponding rule expression matching tree can be extracted through reverse search; and/or,

For partially overlapping logic symbol nodes, the logic symbol nodes in the merged matching network are split.
The method according to claim 1, characterized in that the step of performing feature matching on the to-be-matched data using the merged matching network and the common rule fragments comprises:

Prioritizing each rule identifier in the merged matching network in advance;

The element set involved in each rule identifier in the merged matching network is matched with the to-be-matched data in sequence according to the priority until the match succeeds or ends.
The method according to claim 1, characterized in that the feature matching of the to-be-matched data using the merged matching network and the common rule fragment comprises one or more of the following operations:

The data to be matched is matched with a set of element nodes of the regular expression matching tree involved in each rule identifier, and is entered from the entrance of the merged matching network. If an element node of the regular expression matching tree is matched, the element matching result is cached.
The method according to claim 1, characterized in that the method further comprises: performing feature matching on the to-be-matched data using the merged matching network and the common rule fragments;

If a logical node of the rule expression matching tree is matched, query the cache to see whether the parent element node of the logical node has been hit, wherein:

If there is no cached result, then taking a next element node from the element node set and matching it with the data to be matched;

If there is a cached result, the cached result is directly taken for logical operation; and if the logical node belongs to a public rule fragment, the logical matching result is cached.
The method according to claim 1, characterized in that the feature matching of the to-be-matched data is performed using the merged matching network and the common rule fragment, further comprising:

If a rule identifier node of the rule expression matching tree is matched, the matching rule identifier is returned.
A regular expression matching device, characterized in that it is configured to perform the method according to any one of claims 1 to 15, and comprises:

A grammar checker, used for receiving a regular text string, performing grammar check on the regular text string, and outputting a regular expression;

A feature converter, used for losslessly converting the regular expression into a simplest regular expression based on a simplification algorithm of a cyclic binary code;

A predicate calculator, used to convert the simplest regular expression into a regular expression matching tree based on a predicate calculation algorithm;

A network merger, used to merge multiple regular expression matching trees into a merged matching network and identify common rule fragments;

The feature matcher is used to perform feature matching on the data to be matched by using the merged matching network and the common rule fragment.
A regular expression matching device, characterized by comprising:

At least one processor; and, a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor so as to enable the at least one processor to execute: a method as described in any one of claims 1-15.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program, and when the program is executed by a multi-core processor, the multi-core processor executes the method as described in any one of claims 1-15.