CN113627164B

CN113627164B - Method and system for identifying state explosion type regular expression

Info

Publication number: CN113627164B
Application number: CN202110784458.0A
Authority: CN
Inventors: 卢毓海; 王晓琳; 张春燕; 刘燕兵; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2024-03-01
Anticipated expiration: 2041-07-12
Also published as: CN113627164A

Abstract

The invention discloses a method and a system for identifying a state explosion type regular expression. The method comprises the following steps: 1) For a regular expression to be identified, generating a corresponding NFA graph of the regular expression to obtain a NFA graph set corresponding to the regular expression; 2) For each NFA graph in the NFA graph set, extracting all root subgraphs in the NFA graph, inputting the root subgraphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph; 3) Processing the embedded representation of the NFA graph with a classification model to determine whether the regular expression is a state-explosion type regular expression. The method can efficiently and rapidly process regular expressions in batches, and meets the requirements of high-efficiency processing performance and lower space consumption of an online system.

Description

Method and system for identifying state explosion type regular expression

Technical Field

The invention relates to a method and a system for identifying a state explosion type regular expression, and belongs to the technical field of computer software.

Background

Regular expression matching is a key component in many applications such as network filtering, for example in Deep Packet Inspection (DPI), which can enhance the security of network communications and detect the presence of malicious traffic. Upon completion of regular expression matching, the regular expression should first be converted into a Finite Automaton (FA). The finite automaton is a state machine that recognizes the same language as that expressed by a regular expression, and can be classified into a non-deterministic finite automaton (NFA) and a Deterministic Finite Automaton (DFA) according to whether or not a next state transition is determined. The expression capacity of NFA and DFA is comparable, but since each DFA state is equivalent to a corresponding set of NFA states, the transition from NFA to DFA may lead to a sharp increase in the number of states, a phenomenon known as state explosion. Table 1 describes the spatial complexity versus the temporal complexity of the matching for NFA and DFA under different coding strategies.

Table 1 shows the comparison of the spatial complexity and the matching temporal complexity of NFA and DFA under different compiling strategies

1. DFA is widely used in DPI applications with its efficient matching performance, but state explosion of DFA presents a significant challenge for practical use of DFA. The existing technical method for identifying whether the regular expression will generate the DFA problem of state explosion is to simply set a threshold value to judge whether the DFA state number is too large, and if the threshold value is exceeded, the DFA state number is judged to be the state explosion type regular expression, otherwise, the DFA state number is not judged to be the state explosion type regular expression. Specifically, a threshold value is set in the process of generating the DFA by the regular expression, namely the regular expression is firstly analyzed into an analysis tree, then the analysis tree is converted into the NFA by using a Thompson construction method or a Glushkov construction method, finally the analysis tree is converted into the DFA by using a subset construction method, and if the generated DFA state number exceeds the threshold value, the DFA is judged to be in state explosion, and the corresponding regular expression is in a regular explosion type; otherwise, the system is in a non-state explosion type.

The existing technical scheme mainly judges the state explosion type regular expression by setting a threshold value, and the technology has the following defects:

1. the executable degree is low: judging whether a regular expression is of a state explosion type or not, and executing a complete process of generating the DFA by the regular expression is complex in operation, and the algorithm processing process is slow and is not easy to execute.

2. The spatial complexity is high: if the regular expression is a state explosion type, a large number of DFA states are generated in the process of generating DFA by NFA, and a large amount of buffer space is required to record the states, which is extremely complicated in space.

3. The specific structure of the state explosion cannot be identified: there are some regular expressions whose corresponding DFA state numbers do not reach the threshold, so the technique cannot identify regular expressions that are state-explosive, but they contain a specific state-explosive structure, which should be classified as state-explosive.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a method for identifying a state explosion type regular expression, which can efficiently and rapidly process the regular expression in batches and meet the requirements of high-efficiency processing performance and lower space consumption of an online system. The main idea of the invention is to automatically learn the structural characteristics of an NFA graph (regular expression is generated by using Thompson construction method) by using a Graph Neural Network (GNN) model, embed high-dimensional graph representations into a low-dimensional vector space, and then perform two-classification on the vectorized representation graphs by using a classification model: whether the NFA is a state explosion type regular expression corresponds to the NFA is described in detail in the technical scheme below.

The technical scheme of the invention is as follows:

a method for identifying a state explosion type regular expression comprises the following steps:

1) For a regular expression to be identified, generating a corresponding NFA graph of the regular expression to obtain a NFA graph set corresponding to the regular expression;

2) For each NFA graph in the NFA graph set, extracting all root subgraphs in the NFA graph, inputting the root subgraphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;

3) Processing the embedded representation of the NFA graph with a classification model to determine whether the regular expression is a state-explosion type regular expression.

Further, in step 1), generating an NFA graph corresponding to the regular expression according to a Thompson construction method, and storing information of the NFA graph by using a gexf file; the information of the NFA graph includes: node information and side information of the NFA graph, wherein the node information includes an id value of a node, and the side information includes a start node, an end node, and a transfer character on an edge of each edge.

Further, according to the out-edge character set of the node, the node is divided into three types as characteristic information of the node: accelerating the state explosion, blocking the state explosion and the common node; wherein, the node of the acceleration state explosion is: if the edge-out character set of the node n is overlapped with the prefix of the regular expression and the edge-out character set appears k times continuously, marking the node n as a node for accelerating state explosion; the nodes of the obstructing state explosion are as follows: if a node n accelerating state explosion exists and a node m exists in front of the node n, marking the node m as a node obstructing state explosion, wherein the intersection of an outgoing character set of the node n and the outgoing character set of the node n is an empty set; the nodes other than the nodes accelerating the state explosion and the nodes obstructing the state explosion are marked as common nodes.

Further, the characteristic information of the nodes is added into a WL algorithm, and then all root subgraphs in the gexf file are extracted by using the WL algorithm fused with node classification.

Further, the classification model is an SVM classifier.

Further, the method for training the graph2vec model comprises the following steps: treating all root subgraphs in the NFA graph as a vocabulary in the document by maximizing the formulaThe vectorized representation of the document is obtained and the optimization objective is to maximize the probability of the root graph belonging to itself appearing in each graph, i.e. maximize +.>Wherein d _i Vectorized representation representing the ith document, w _j An embedded representation representing the jth word in the vocabulary as it appears in the ith document, pr (w _j |d _i ) Represented in document d _i Chinese word w _j Conditional probability of occurrence, Φ (G) is the vectorized representation of the graph G for which an embedded representation is desired, +.>A vectorized representation of a d-degree root graph rooted at node n in graph G is shown.

The recognition system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root chart extraction module, an embedded representation generation module of the NFA graph and a classification model; wherein the method comprises the steps of

The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression and obtaining an NFA graph set corresponding to the regular expression;

the root graph extraction module is used for extracting all root graphs in the NFA graph;

the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;

the classification model is used for judging whether the regular expression is a state explosion type regular expression according to the embedded representation of the NFA graph in the regular expression.

According to the invention, a graph2vec model is selected as a basic model, and the characteristics of an NFA graph are combined, and the side information in the NFA is added to optimize the input of the model, so that more correct classification is realized. The key technical scheme of the invention is described as follows:

1. introduction to graph2vec model

The selected basic model is a graph2vec model, which is a graph neural network framework for unsupervised representation learning, is used for learning undirected graphs, directed graphs and embedded representations of weighted graphs, and can be used as input data for subsequent tasks such as graph classification and the like. The specific design of the graph2vec model is as follows:

(1) Extracting a root graph: first, the WL algorithm proposed by shaovashidze (shown as algorithm one) is used to extract the d-degree root graphs of each node in the NFA graphs, and assign a unique label to all the root graphs, so that each NFA graph can be regarded as being composed of a set of root graphs, and if each graph is regarded as a document, all the root graphs in the graph correspond to all the words that make up the document, as shown in fig. 1.

(2) Training: the root graphs of all graphs obtained by the WL algorithm can be regarded as vocabularies in the document, and the idea of the doc2vec model is utilized: the training goal of the model is to maximize the conditional probability of a word that appears in one document, and minimize the conditional probability of a word that does not appear in that document, i.e., by maximizing equation (1) to obtain a vectorized representation of each document, the closer the representation in the low-dimensional vector space of two documents that consist of more identical words should be, where d _i Vectorized representation representing the ith document, w _j An embedded representation representing the jth word in the vocabulary as it appears in the ith document, pr (w _j |d _i ) Represented in document d _i Chinese word w _j Conditional probability of occurrence.

If the two graphs consist of more similar root graphs, the graph2vec model also expects that their low-dimensional embedded representations should be closer, so its optimization goal is to maximize the probability of each graph appearing in its own root graph, as in equation (2), where Φ (G) is the vectorized representation of graph G (NFA graph generated from regular expressions) that is expected to get an embedded representation, sg _n ^(d) A vectorized representation of a d-degree root graph rooted at node n in graph G is shown.

In the model training process, the thought of a skip gram model is utilized when the formula (2) is maximized, and a random gradient descent optimization algorithm is utilized to obtain a final graph embedded representation by a negative sampling technology.

(3) Use case: after obtaining the embedded representation of each graph, the embedded representation can be used for subsequent classification or clustering tasks, for example, the graph can be classified into two categories by using an SVM algorithm, or the graph can be clustered by using a K-means clustering method.

2. Optimizing model data input

The invention is an output label of a prediction NFA graph sample, and regarding an NFA graph set generated by a regular expression set to be processed, note that the NFA graph not only has a node and an edge structure, but also has transfer information on the edge carrying more state explosion information, so that the processing of the edge information is added on the basis of a graph2vec model. The specific design thought is that the nodes are divided into three types by analyzing the edge character set of the nodes, and the three types are used as characteristic information of the nodes: a node accelerating the state explosion, a node obstructing the state explosion, and a common node.

(1) Nodes accelerating state explosions: if the edge character set of the node n overlaps with the prefix of the regular expression and the character set appears k times (k > 1) consecutively, the node n can cause the DFA state number corresponding to the regular expression to increase sharply, and the node n is marked as a node accelerating state explosion.

(2) Nodes that hinder state explosions: if there is a node n of the accelerated state explosion described above, and there is a node m before the node n, where the intersection of the set of edge characters and the set of edge characters of the node n is an empty set, such a node m will cause the DFA state number corresponding to the regular expression to grow slowly, and the node m is marked as a node of the blocked state explosion.

(3) Common node: all nodes except the two types of nodes are marked as common nodes.

After classifying the NFA according to the three types, node classification characteristic information can be added into the WL algorithm, the obtained root graph not only contains the structural information of the graph, but also contains the information of edges, and then the embedded representation of the NFA graph is obtained by continuously training by using a graph2vec model.

Compared with the prior art, the invention has the following positive effects:

note that NFA is a graph consisting of edges and nodes, and some special structures can cause explosion of the generated DFA states, so when a regular expression is classified into two, we convert it to graph structures, and use Graph Neural Network (GNN) to solve tasks related to the graph in an end-to-end manner, and extract task features of its embedded representation, so as to obtain a low-dimensional embedded representation of each NFA graph, so that the representation of graphs with similar structures in vector space is also similar, so as to achieve the purpose that we want to classify it by using features of the graph structures.

In the process of realizing the model, node characteristics are added according to the side information, and a complete NFA graph is decomposed into a plurality of sub-graph representations, so that graph classification results with similar substructures can be further identical, and the two key points are specifically described below.

Information of the joining side:

according to the Thompson construction method, generating an NFA graph corresponding to the regular expression, wherein the information of the edges in the graph determines whether the generated DFA can explode, even if two NFA graphs are isomorphic, if the transfer information on two corresponding edges is different, the two NFAs can also have different labels, so that the information of the edges needs to be added in the model, the method for adding the edge information is important, and the characteristics of the state explosion type regular expression need to be fully reflected. Through research and experimental verification of related documents, the model adopts the method that the node is divided into three types by analyzing the edge information of the node, and the node is taken as the characteristic information of the node and added into a WL algorithm, so that the extracted root map contains the characteristic information of the edge causing state explosion.

Extracting a root graph:

note that if two NFA graphs are composed of more identical sub-graphs (containing information on edges), they will have the same labels with a high probability, and experiments verify that NFA corresponding to most state explosive regular expressions have similar sub-structures, therefore, during model training, the WL algorithm is used to extract root graphs of all graphs, and the graphs are regarded as being composed of a set of root graphs, the resulting graph embeds information representing more sub-graphs, so that the representation of NFA with similar sub-graphs in the embedding space is also closer.

Specifically, for a regular expression (in a regular library such as snort, l 7), the process of determining whether a DFA generates a state explosion is as follows: firstly, generating an NFA graph corresponding to a regular expression according to a Thompson construction method, storing information of the graph by using a gexf file, extracting all root subgraphs of the gexf graph by using a WL algorithm of fusion node classification, training by using a graph2vec model to obtain an embedded representation of the graph, and finally processing the graph of the embedded representation by using an SVM classifier to obtain a final label: 1 (explosive) or-1 (non-explosive).

The hardware configuration of the experiment of the invention is shown in Table 2:

table 2 hardware configuration of experiments

Operating system	CentOS Linux release 7.5.1804
		Memory	125G
CPU	Intel(R)Xeon(R)CPU E5-2667 v4@3.20GHz

Experiment design: 3520 regular expressions are selected from the regular libraries clamav_regex_strings_2015, l7_rules_2015 and snort_regex_rules_2015, and are converted into corresponding NFA graphs through a Thompson construction method, and an embedded representation of each graph is obtained through training a model. In the process of SVM two classification, since the SVM model is supervised learning, each regular expression is labeled by a threshold value method, and 1570 regular explosion types (label is-1) in 3520 regular expressions are obtained as a result, 1720 irregular explosion types (label is-1), 70% (3168) NFA embedded representations are randomly selected as a training set during training, and the remaining 30% (352) NFA embedded representations are selected as a test set. D=3, 5,8 were set respectively, and experimental results are shown in table 3:

table 3 comparison of experimental results

Experimental results prove that the highest classification accuracy can be achieved when d=5 is set, namely, 5-hop neighbor information is aggregated, and the highest classification accuracy is 98%.

Drawings

FIG. 1 is a Graph2vec model vs. Doc2vec model ideas;

(a) Doc2vec model idea, (b) Graph2vec model idea;

FIG. 2 is a flow chart of the method of the present invention;

fig. 3 shows an SVM classification example.

Detailed Description

Preferred examples of the present invention will be described in detail below with reference to the accompanying drawings.

The original input of the present invention is a regular expression set, such as the regular expression shown in Table 4: "TRA [ ≡A ] {3} HL", then as shown in FIG. 2, the process of predicting whether a regular expression is of the state explosion type is divided into the following stages:

(1) Regular expression generation NFA first generates its corresponding NFA graph structure for each regular expression using Thompson construction, table 4 shows the NFA graph corresponding to the regular expression "TRA [ ≡A ] {3} HL".

Table 4 shows an example of the wl algorithm execution process

(2) NFA was converted to standard graph format data: firstly, each regular expression generates a corresponding NFA graph by a Thompson construction method, then uses a gexf format file to record NFA graph information, for example, a gexf file which stores the regular expression 'TRA [ ≡A ] {3} HL' corresponding to the NFA graph structure in table 4 is shown as follows and comprises node information and side information of the NFA graph, wherein the node information comprises id values of nodes, the side information comprises a start node (source) and a termination node (target) of each side, and transfer characters (namely, out-side characters, weight is represented by ASCII code values) of the sides,

(3) Extracting all root subgraphs of the NFA: in order to facilitate explanation of a specific process of extracting a root subgraph after adding side information, d=3 is set, namely nodes in 3-hop neighborhood are aggregated, taking an NFA graph in table 4 as an example, a WL algorithm extracts 3-degree root subgraphs of each node in the graph, wherein a number a in "a+b" represents an iterative process of a-th round, information in the aggregated a-hop neighborhood is also represented, and a number b represents a root subgraph label obtained by aggregation. The 0 th round records characteristic information of the nodes, namely node classification results obtained by the side information of the nodes, wherein '1' represents the node for accelerating the state explosion, '2' represents the node for blocking the state explosion, and '0' represents the common node. For example, node 3 is labeled "1" because its out-edge character set is A (all character sets except A), appears 3 times in succession thereafter, and has an overlapping character T with the prefix T, thus the node accelerates the state explosion; and node number 2 is labeled "2" because its intersection of the out-edge character A with the following character set A is empty, which would prevent the generation of a state explosion. The first number in brackets represents the root sub-graph number of the node in the previous round, then the root sub-graph number of the previous round is used for all the degree-output nodes, the combined sequence is mapped to the new number to obtain the number of the root sub-graph, finally the 3 degree root sub-graph of all the nodes is obtained, and then NFA in table 4 is composed of 9 3 degree root sub-graphs.

(4) Obtaining an embedded representation of the NFA graph: after WL algorithm is executed on each NFA graph, a set of all root subgraphs corresponding to the NFA graph can be obtained, then each NFA graph is formed by a root subgraph set corresponding to the NFA, all root subgraphs and the corresponding relationship between the root subgraphs and the NFA graph (which NFA graphs each root subgraph belongs to) are input into a known skip gram model, and embedded representation of each NFA graph can be obtained through training, namely, high-dimensional graph structure data can be represented by using low-dimensional vectors with dimensions of 1 x 1024 dimensions (the dimension is set manually).

(5) Classification prediction: after embedding the NFA graph into a low-dimensional vector space, a classification model can be used to classify the NFA (i.e. the embedded representation of the NFA graph obtained in the last step) of the vector representation, for example, according to the SVM (support vector machine) classification method adopted by the invention, the embedded representation of the NFA graph is directly input into the classification model, and the class of the regular expression corresponding to the NFA graph (i.e. whether the regular expression is a state explosion type regular expression) can be obtained. The basic idea of the SVM is to solve a separation hyperplane that can correctly divide the training data set and has the largest geometric interval, and points in two parts of space separated by the hyperplane are two corresponding classification results, as shown in fig. 3.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and enhancements can be made to the present invention by those of ordinary skill in the art without departing from the principles of the present invention, and that various substitutions, alterations and modifications are possible without departing from the spirit and scope of the present invention. The invention should not be limited to the embodiments of the present description and the disclosure of the drawings, but the scope of the invention is defined by the claims.

Claims

1. A method for identifying a state explosion type regular expression comprises the following steps:

1) For a regular expression to be identified, generating an NFA image corresponding to the regular expression according to a Thompson construction method, and storing information of the NFA image by using a gexf file to obtain an NFA image set corresponding to the regular expression; the information of the NFA graph includes: node information and side information of the NFA graph, wherein the node information comprises an id value of a node, and the side information comprises a start node, a stop node and transfer characters on the sides of each side; according to the out-edge character set of the node, the node is divided into three types as characteristic information of the node: accelerating the state explosion, blocking the state explosion and the common node; wherein, the node of the acceleration state explosion is: if the edge-out character set of the node n is overlapped with the prefix of the regular expression and the edge-out character set appears k times continuously, marking the node n as a node for accelerating state explosion; the nodes of the obstructing state explosion are as follows: if a node n accelerating state explosion exists and a node m exists in front of the node n, marking the node m as a node obstructing state explosion, wherein the intersection of an outgoing character set of the node n and the outgoing character set of the node n is an empty set; marking the nodes accelerating the state explosion and the rest nodes except the nodes obstructing the state explosion as common nodes;

2) For each NFA graph in the NFA graph set, adding the characteristic information of the nodes of the NFA graph into a WL algorithm, extracting all root subgraphs in a gexf file corresponding to the NFA graph by using the WL algorithm of fusion node classification, inputting the root subgraphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;

3) And processing the embedded representation of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression or not.

2. The method of claim 1, wherein the classification model is an SVM classifier.

3. The method of claim 1, wherein the method of training the graph2vec model is: treating all root subgraphs in the NFA graph as a vocabulary in the document by maximizing the formulaThe vectorized representation of the document is obtained and the optimization objective is to maximize the probability of each graph appearing in its own root graph, i.e. to maximizeWherein d _i Representing the ith articleVectorized representation of gear, w _j An embedded representation representing the jth word in the vocabulary as it appears in the ith document, pr (w _j |d _i ) Represented in document d _i Chinese word w _j Conditional probability of occurrence, Φ (G) is the vectorized representation of the graph G for which an embedded representation is desired, +.>A vectorized representation of a d-degree root graph rooted at node n in graph G is shown.

4. The recognition system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root chart extraction module, an embedded representation generation module of the NFA graph and a classification model; wherein the method comprises the steps of

The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression according to the Thompson construction method, storing information of the NFA graph by using a gexf file, and obtaining an NFA graph set corresponding to the regular expression; the information of the NFA graph includes: node information and side information of the NFA graph, wherein the node information comprises an id value of a node, and the side information comprises a start node, a stop node and transfer characters on the sides of each side; according to the out-edge character set of the node, the node is divided into three types as characteristic information of the node: accelerating the state explosion, blocking the state explosion and the common node; wherein, the node of the acceleration state explosion is: if the edge-out character set of the node n is overlapped with the prefix of the regular expression and the edge-out character set appears k times continuously, marking the node n as a node for accelerating state explosion; the nodes of the obstructing state explosion are as follows: if a node n accelerating state explosion exists and a node m exists in front of the node n, marking the node m as a node obstructing state explosion, wherein the intersection of an outgoing character set of the node n and the outgoing character set of the node n is an empty set; marking the nodes accelerating the state explosion and the rest nodes except the nodes obstructing the state explosion as common nodes;

the root graph extraction module is used for adding the characteristic information of the nodes of the NFA graph into a WL algorithm for each NFA graph in the NFA graph set, and extracting all root graphs in the gexf file corresponding to the NFA graph by using the WL algorithm fused with node classification;