CN113627164A

CN113627164A - Method and system for identifying state explosion type regular expression

Info

Publication number: CN113627164A
Application number: CN202110784458.0A
Authority: CN
Inventors: 卢毓海; 王晓琳; 张春燕; 刘燕兵; 谭建龙; 郭莉
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-11-09
Anticipated expiration: 2041-07-12
Also published as: CN113627164B

Abstract

The invention discloses a method and a system for identifying a state explosion type regular expression. The method comprises the following steps: 1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression; 2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph; 3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression. The method can efficiently and quickly process the regular expressions in batches, and meets the requirements of high-efficiency processing performance and low space consumption of an online system.

Description

Method and system for identifying state explosion type regular expression

Technical Field

The invention relates to a method and a system for identifying a state explosion type regular expression, and belongs to the technical field of computer software.

Background

Regular expression matching is a key component in many applications such as network filtering, for example in Deep Packet Inspection (DPI), which can enhance the security of network communications and detect the presence of malicious traffic. When the regular expression matching is completed, the regular expression should be converted into a Finite Automaton (FA) first. A finite automaton is a state machine that recognizes the same language as that represented by a regular expression, and FAs can be classified into a non-deterministic finite automaton (NFA) and a Deterministic Finite Automaton (DFA) according to whether the next state transition is determined. The expression capabilities of NFA and DFA are comparable, but since each DFA state is identical to the corresponding set of NFA states, the transition from NFA to DFA may result in a dramatic increase in the number of states, a phenomenon known as state explosion. Table 1 describes the spatial complexity versus matching temporal complexity of NFA and DFA under different coding strategies.

Table 1 shows the spatial complexity versus the matching temporal complexity for NFA and DFA under different coding strategies

1. DFAs are widely used in DPI applications with their efficient matching performance, but the explosion of DFAs's state presents significant challenges for their practical application. The existing technical method for identifying whether the regular expression can generate the DFA problem of state explosion or not is to judge whether the DFA state number is too large or not in a simple mode of setting a threshold, and if the DFA state number exceeds the threshold, the DFA is judged to be the regular expression of the state explosion type, otherwise, the DFA is not judged to be the regular expression of the state explosion type. Specifically, a threshold value is set in the process of generating the DFA by the regular expression, namely, the regular expression is firstly analyzed into an analytic tree, then the analytic tree is converted into the NFA by using a Thompson construction method or a Glushkov construction method, finally the NFA is converted into the DFA by using a subset construction method, if the state number of the generated DFA exceeds the threshold value, the DFA is judged to be state explosion, and the corresponding regular expression is a regular explosion type; otherwise, it is non-state explosion type.

The existing technical scheme is mainly to judge the state explosion type regular expression by setting a threshold, and the technology has the following defects:

1. the executable degree is low: judging whether one regular expression is in a state explosion type, a complete process of generating the DFA by the regular expression needs to be executed, the operation is complex, the algorithm processing process is slow, and the execution is difficult.

2. The space complexity is high: if the regular expression is of a state explosion type, a large number of DFA states are generated in the process of generating the DFA by the NFA, a large amount of cache space is needed for recording the states, and the space complexity is extremely high.

3. The specific structure of the state explosion type cannot be identified: there are some regular expressions whose corresponding number of DFA states does not reach the threshold, so this technique cannot identify them as state explosion type regular expressions, but they contain a specific state explosion type structure, and should be classified as state explosion type.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to provide a method for identifying a state explosion type regular expression, which can efficiently and quickly process the regular expression in batches and meet the requirements of high-efficiency processing performance and low space consumption of an online system. The main idea of the invention is to automatically learn the structural features of the NFA graph (regular expressions are generated by Thompson construction method) by using the model of the Graph Neural Network (GNN), embed the graph representation of high dimension into the vector space of low dimension, and then classify the graphs represented vectorially by using the classification model: whether the regular expression is the NFA corresponding to the state explosion type regular expression is described in detail below.

The technical scheme of the invention is as follows:

a method for identifying a state explosion type regular expression comprises the following steps:

1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression;

2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;

3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression.

Further, in step 1), an NFA graph corresponding to the regular expression is generated according to a Thompson construction method, and information of the NFA graph is stored in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.

Further, according to the edge character set of the node, the node is divided into three categories as the characteristic information of the node: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.

Further, the characteristic information of the nodes is added into a WL algorithm, and then all root subgraphs in the gexf file are extracted by using the WL algorithm of the fusion node classification.

Further, the classification model is an SVM classifier.

Further, the method for training the graph2vec model comprises the following steps: all root sub-graphs in the NFA graph are treated as a vocabulary in the document by maximizing the formula

Deriving a vectorized representation of the document, and an advantageThe goal of the quantization is to maximize the probability of each graph appearing as a root graph belonging to itself, i.e., maximize

Wherein d is_iVectorized representation, w, of the ith document_jAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabulary_j|d_i) Is shown in document d_iChinese word w_jThe conditional probability of occurrence, Φ (G), is the vectorized representation of graph G for which an embedded representation is desired,

representing a vectorized representation of the d-degree root child graph rooted at node n in graph G.

The identification system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root graph extraction module, an NFA graph embedded representation generation module and a classification model; wherein

The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression to obtain an NFA graph set corresponding to the regular expression;

the root subgraph extraction module is used for extracting all root subgraphs in the NFA graph;

the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;

the classification model is used for judging whether the regular expression is a state explosion type regular expression or not according to the embedded expression of the NFA graph in the regular expression.

According to the invention, firstly, a graph2vec model is selected as a basic model, the characteristics of an NFA (NFA) graph are combined, and side information in the NFA is added to optimize the input of the model, so that more correct classification is realized. The key technical scheme of the invention is introduced as follows:

1. graph2vec model introduction

The selected basic model is a graph2vec model which is a graph neural network framework for unsupervised representation learning, is used for learning the embedded representations of an undirected graph, a directed graph and a weighted graph, and can be used as input data for subsequent tasks such as graph classification. The specific design of the graph2vec model is as follows:

(1) extracting a root subgraph: firstly, a WL algorithm (shown as algorithm one) proposed by Shervashidze is used to extract a d-degree root child graph of each node in an NFA graph, and a unique label is allocated to all the root child graphs, so that each NFA graph can be regarded as being composed of a group of root child graphs, and if each graph is regarded as a document, all the root child graphs in the graph correspond to all words composing the document, as shown in fig. 1.

(2) Training: the WL algorithm gets the root sub-graph of all the graphs to be regarded as the vocabulary in the document, and the idea of doc2vec model is utilized: the more closely two documents consisting of the same more words should be represented in the low-dimensional vector space, the training goal of the model is to maximize the conditional probability of a word appearing in one document and minimize the conditional probability of a word not appearing in that document, i.e., obtain a vectorized representation of each document by maximizing equation (1), where d_iVectorized representation, w, of the ith document_jAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabulary_j|d_i) Is shown in document d_iChinese word w_jConditional probability of occurrence.

If two graphs are composed of more similar root subgraphs, the graph2vec model also expects that their low-dimensional embedded representations should be closer together, so its optimization goal is to maximize the probability of each graph appearing as a root subgraph belonging to itself, as in equation (2), where Φ (G) is the vectorized representation of graph G (NFA graph generated by regular expression) that one wishes to obtain an embedded representation, sg_n ^(d)Representation diagramAnd G is used for vectorizing the d-degree root child graph taking the node n as the root.

In the model training process, the idea of a skipgram model is utilized when the formula (2) is maximized, a negative sampling technology is utilized, and a random gradient descent optimization algorithm is utilized to obtain the final graph embedding representation.

(3) Use case: after the embedded representation of each graph is obtained, the graph can be used for subsequent classification or clustering tasks, for example, an SVM algorithm can be used for carrying out secondary classification on the graph, or a K-means clustering method is used for carrying out clustering on the graph.

2. Optimizing model data input

The invention discloses an output label of a prediction NFA graph sample, and for an NFA graph set generated by a regular expression set needing to be processed, the NFA graph not only has node and edge structures, but also has more state explosion information carried by transition information on the edges, so that the processing of the edge information is added on the basis of a graph2vec model. The specific design idea is that the nodes are divided into three categories as characteristic information of the nodes by analyzing the edge character set of the nodes: nodes accelerating state explosion, nodes hindering state explosion and ordinary nodes.

(1) Node of accelerated state explosion: if the edge character set of the node n is overlapped with the prefix of the regular expression and the character set appears k times (k >1) continuously, the node n can enable the number of DFA states corresponding to the regular expression to increase rapidly, and the node n is marked as a node in an accelerated state explosion.

(2) Nodes that hinder state explosion: if there is a node n of the accelerated explosion state described above, and there is a node m before the node n, and the intersection of the edge character set of the node m and the edge character set of the node n is an empty set, then such a node m will cause the number of DFA states corresponding to the regular expression to increase slowly, and mark the node m as a node that hinders the explosion state.

(3) And (3) common nodes: all nodes except the above two types of nodes are marked as normal nodes.

After the NFA is classified according to the three types, classification characteristic information of nodes can be added into a WL algorithm, the obtained root sub-graph not only contains the structure information of the graph, but also contains information on edges, and then, the graph2vec model training is continuously utilized to obtain the embedded representation of the NFA.

Compared with the prior art, the invention has the following positive effects:

note that NFA is a graph composed of edges and nodes, and some special structures may cause the generated DFA state explosion, so when we classify regular expressions, we convert it to graph structure, solve the graph-related tasks in an end-to-end manner using Graph Neural Network (GNN), and extract the task features of its embedded representation to get a low-dimensional embedded representation of each NFA graph, so that the representations of graphs with similar structures in vector space are also close to achieve the purpose that we want to classify them using the features of graph structure.

In the process of implementing the model, node features are added according to edge information, and a complete NFA graph is decomposed into a plurality of sub-graph representations, which can further make the graph classification results with similar sub-structures the same, and the two key points are specifically described as follows.

Adding side information:

the method comprises the steps that NFA graphs corresponding to regular expressions are generated according to a Thompson construction method, information of edges in the graphs determines whether the generated DFAs can be subjected to state explosion or not, even if the two NFA graphs are isomorphic, if transfer information of two corresponding edges is different, the two NFAs can have different labels, therefore, the information of the edges needs to be added into a model, the method for adding the edge information is very important, and the characteristics of the state explosion type regular expressions need to be fully embodied. Through investigation and experimental verification of relevant documents, the model divides nodes into three classes by analyzing the edge information of the nodes, and adds the three classes as the characteristic information of the nodes into a WL algorithm, so that the extracted root sub-graph contains the characteristic information of state explosion caused by edges.

Extracting a root subgraph:

note that if two NFA graphs are composed of more identical subgraphs (containing information on edges), they have the same labels with a high probability, and experiments verify that NFAs corresponding to most state explosive regular expressions have similar substructures, therefore, in the process of model training, the WL algorithm is used to extract root subgraphs of all graphs, and the graphs are regarded as being composed of a group of root subgraphs, and the obtained graphs are embedded with information indicating more contained subgraphs, so that NFAs having similar subgraphs are also represented more closely in an embedding space.

Specifically, for one regular expression (in a regular library such as snort, l 7), the process of determining whether it generates a DFA of a state explosion is as follows: firstly, generating an NFA graph corresponding to a regular expression according to a Thompson construction method, storing information of the graph by using a gexf file, extracting all root subgraphs of the gexf graph by using a WL algorithm of fusion node classification, then training by using a graph2vec model to obtain an embedded representation of the graph, and finally processing the graph of the embedded representation by using an SVM classifier to obtain a final label: 1 (explosive) or-1 (non-explosive).

The hardware configuration of the experiment of the invention is shown in table 2:

TABLE 2 hardware configuration of the experiment

Operating system	CentOS Linux release 7.5.1804
		Memory device	125G
CPU	Intel(R)Xeon(R)CPU E5-2667 v4@3.20GHz

Experiment design: 3520 regular expressions are selected from the regular libraries clav _ regex _ strings _2015, l7_ rules _2015 and snort _ regex _ rules _2015, are converted into corresponding NFA (N function of constraint) graphs through a Thompson construction method, and a model is trained to obtain embedded representation of each graph. In the process of SVM two-classification, because an SVM model is supervised learning, labeling each regular expression by using a threshold value method, and as a result, 1570 regular expressions (labeled as-1) and 1720 irregular expressions (labeled as 1) in 3520 regular expressions, randomly selecting 70% (3168) NFA embedded expressions as a training set during training, and the rest 30% (352) NFA embedded expressions as a test set. Setting d to be 3, 5 and 8 respectively, the experimental results are shown in table 3:

TABLE 3 comparison of the results

The experimental result proves that the highest classification accuracy rate can be achieved when d is set to be 5, namely 5-hop neighbor information is aggregated, and the classification accuracy rate is 98%.

Drawings

FIG. 1 is a comparison of the concept of the Graph2vec model and the concept of the Doc2vec model;

(a) a Doc2vec model idea, (b) a Graph2vec model idea;

FIG. 2 is a flow chart of the method of the present invention;

fig. 3 is an example of SVM binary classification.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The original input to the present invention is a set of regular expressions, such as the regular expressions shown in Table 4: "TRA [. Lambda ] {3} HL', the process of predicting whether a regular expression is state-explosive, as shown in FIG. 2, is divided into the following stages:

(1) firstly, generating an NFA (NFA) graph corresponding to each regular expression by using a Thompson construction method, and showing an NFA graph corresponding to the regular expression of TRA [ ^ A ] {3} HL ] in a table 4.

Table 4 shows an example of the execution of the wl algorithm

(2) NFA translates to standard graph format data: firstly, each regular expression generates a corresponding NFA graph by a Thompson construction method, and then, a gexf format file is used for recording NFA graph information, for example, the gexf file corresponding to the NFA graph structure of the regular expression "TRA [ ^ A ] {3} HL" in the storage table 4 contains node information and edge information of the NFA graph, wherein the node information includes id values of nodes, the edge information includes a start node (source) and an end node (target) of each edge, and transition characters on the edge (namely, edge characters, weight is represented by ASCII code values),

(3) extracting all root subgraphs of NFA: for convenience of explaining a specific process of extracting a root sub-graph after adding side information, setting d to 3, that is, aggregating nodes in a 3-hop neighborhood, taking an NFA graph in table 4 as an example, a WL algorithm extracts a 3-degree root sub-graph of each node in the graph, where a number a in "a + b" represents an iteration process of the a-th round and also represents information in an a-hop neighborhood of aggregation, and b represents a root sub-graph label obtained by aggregation. The 0 th round records the characteristic information of the node, namely the node classification result obtained by the edge information of the node, wherein "1" represents the node in the acceleration state explosion, "2" represents the node in the obstruction state explosion, and "0" represents the normal node. For example, node number 3 is labeled "1" because its outgoing character set is ^ A (all character sets except A), occurs 3 times consecutively thereafter, and has an overlap character T with the prefix T, so the node accelerates the state explosion; node number 2 is labeled "2" because its edge-out character A intersects the following character set A null, which hinders the generation of a state explosion. The first number in parentheses represents the root sub-graph label of the node in the previous round, and then the root sub-graph labels of all the out-degree nodes in the previous round, the combined sequence is mapped to the new label to obtain the label of the root sub-graph in the previous round, and finally the 3-degree root sub-graphs of all the nodes are obtained, so that the NFA in table 4 is composed of 9 3-degree root sub-graphs.

(4) Get an embedded representation of the NFA graph: after a WL algorithm is executed on each NFA graph, a set of all root subgraphs corresponding to the NFA graph can be obtained, each NFA graph is composed of a set of root subgraphs corresponding to the NFA, the corresponding relation between all the root subgraphs and the NFA graphs (to which each root subgraph belongs) is input into a known skipgram model, and the embedded representation of each NFA graph can be obtained through training, namely high-dimensional graph structure data can be represented by low-dimensional vectors with 1 x 1024 dimensions (the dimension is set manually).

(5) And (3) classification prediction: after the NFA graph is embedded into the low-dimensional vector space, the NFA represented by quantization (i.e., the embedded representation of the NFA graph obtained in the previous step) may be classified by using a classification model, for example, an SVM (support vector machine) binary classification method adopted in the present invention, and the embedded representation of the NFA graph is directly input into the classification model, so as to obtain the category of the regular expression corresponding to the NFA graph (i.e., whether the regular expression is a state explosion type regular expression). The basic idea of the SVM is to solve a separating hyperplane which can correctly divide the training data set and has the largest geometric interval, and the points in the two parts of space separated by the hyperplane are the corresponding two classification results, as shown in fig. 3.

The foregoing is merely a preferred embodiment of the present invention, and it should be understood that various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims

1. A method for identifying a state explosion type regular expression comprises the following steps:

2. The method according to claim 1, wherein in step 1), the NFA graph corresponding to the regular expression is generated according to a Thompson construction method, and information of the NFA graph is stored in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.

3. The method according to claim 2, characterized in that the nodes are classified into three categories according to their edge character sets as the characteristic information of the nodes: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.

4. The method of claim 3, wherein the characteristic information of the node is added to the WL algorithm, and then all root subgraphs in the gexf file are extracted using the WL algorithm that fuses the node classifications.

5. The method of claim 1, wherein the classification model is an SVM classifier.

6. The method of claim 1, wherein the method of training the graph2vec model is: all root sub-graphs in the NFA graph are treated as a vocabulary in the document by maximizing the formula

Obtaining a vectorized representation of said document, and the optimization objective is to maximize the probability of each graph appearing as a root graph belonging to itself, i.e. to maximize

7. The identification system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root graph extraction module, an NFA graph embedded representation generation module and a classification model; wherein

8. The system of claim 7, wherein the NFA graph transformation module generates an NFA graph corresponding to the regular expression according to a Thompson construction method, and stores information of the NFA graph in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.

9. The system of claim 8, wherein the nodes are classified into three categories according to their edge character sets as the characteristic information of the nodes: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.