CN113627164A - Method and system for identifying state explosion type regular expression - Google Patents

Method and system for identifying state explosion type regular expression Download PDF

Info

Publication number
CN113627164A
CN113627164A CN202110784458.0A CN202110784458A CN113627164A CN 113627164 A CN113627164 A CN 113627164A CN 202110784458 A CN202110784458 A CN 202110784458A CN 113627164 A CN113627164 A CN 113627164A
Authority
CN
China
Prior art keywords
node
graph
nodes
nfa
regular expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110784458.0A
Other languages
Chinese (zh)
Other versions
CN113627164B (en
Inventor
卢毓海
王晓琳
张春燕
刘燕兵
谭建龙
郭莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202110784458.0A priority Critical patent/CN113627164B/en
Publication of CN113627164A publication Critical patent/CN113627164A/en
Application granted granted Critical
Publication of CN113627164B publication Critical patent/CN113627164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for identifying a state explosion type regular expression. The method comprises the following steps: 1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression; 2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph; 3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression. The method can efficiently and quickly process the regular expressions in batches, and meets the requirements of high-efficiency processing performance and low space consumption of an online system.

Description

Method and system for identifying state explosion type regular expression
Technical Field
The invention relates to a method and a system for identifying a state explosion type regular expression, and belongs to the technical field of computer software.
Background
Regular expression matching is a key component in many applications such as network filtering, for example in Deep Packet Inspection (DPI), which can enhance the security of network communications and detect the presence of malicious traffic. When the regular expression matching is completed, the regular expression should be converted into a Finite Automaton (FA) first. A finite automaton is a state machine that recognizes the same language as that represented by a regular expression, and FAs can be classified into a non-deterministic finite automaton (NFA) and a Deterministic Finite Automaton (DFA) according to whether the next state transition is determined. The expression capabilities of NFA and DFA are comparable, but since each DFA state is identical to the corresponding set of NFA states, the transition from NFA to DFA may result in a dramatic increase in the number of states, a phenomenon known as state explosion. Table 1 describes the spatial complexity versus matching temporal complexity of NFA and DFA under different coding strategies.
Table 1 shows the spatial complexity versus the matching temporal complexity for NFA and DFA under different coding strategies
Figure BDA0003158612680000011
1. DFAs are widely used in DPI applications with their efficient matching performance, but the explosion of DFAs's state presents significant challenges for their practical application. The existing technical method for identifying whether the regular expression can generate the DFA problem of state explosion or not is to judge whether the DFA state number is too large or not in a simple mode of setting a threshold, and if the DFA state number exceeds the threshold, the DFA is judged to be the regular expression of the state explosion type, otherwise, the DFA is not judged to be the regular expression of the state explosion type. Specifically, a threshold value is set in the process of generating the DFA by the regular expression, namely, the regular expression is firstly analyzed into an analytic tree, then the analytic tree is converted into the NFA by using a Thompson construction method or a Glushkov construction method, finally the NFA is converted into the DFA by using a subset construction method, if the state number of the generated DFA exceeds the threshold value, the DFA is judged to be state explosion, and the corresponding regular expression is a regular explosion type; otherwise, it is non-state explosion type.
The existing technical scheme is mainly to judge the state explosion type regular expression by setting a threshold, and the technology has the following defects:
1. the executable degree is low: judging whether one regular expression is in a state explosion type, a complete process of generating the DFA by the regular expression needs to be executed, the operation is complex, the algorithm processing process is slow, and the execution is difficult.
2. The space complexity is high: if the regular expression is of a state explosion type, a large number of DFA states are generated in the process of generating the DFA by the NFA, a large amount of cache space is needed for recording the states, and the space complexity is extremely high.
3. The specific structure of the state explosion type cannot be identified: there are some regular expressions whose corresponding number of DFA states does not reach the threshold, so this technique cannot identify them as state explosion type regular expressions, but they contain a specific state explosion type structure, and should be classified as state explosion type.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a method for identifying a state explosion type regular expression, which can efficiently and quickly process the regular expression in batches and meet the requirements of high-efficiency processing performance and low space consumption of an online system. The main idea of the invention is to automatically learn the structural features of the NFA graph (regular expressions are generated by Thompson construction method) by using the model of the Graph Neural Network (GNN), embed the graph representation of high dimension into the vector space of low dimension, and then classify the graphs represented vectorially by using the classification model: whether the regular expression is the NFA corresponding to the state explosion type regular expression is described in detail below.
The technical scheme of the invention is as follows:
a method for identifying a state explosion type regular expression comprises the following steps:
1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression;
2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;
3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression.
Further, in step 1), an NFA graph corresponding to the regular expression is generated according to a Thompson construction method, and information of the NFA graph is stored in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.
Further, according to the edge character set of the node, the node is divided into three categories as the characteristic information of the node: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.
Further, the characteristic information of the nodes is added into a WL algorithm, and then all root subgraphs in the gexf file are extracted by using the WL algorithm of the fusion node classification.
Further, the classification model is an SVM classifier.
Further, the method for training the graph2vec model comprises the following steps: all root sub-graphs in the NFA graph are treated as a vocabulary in the document by maximizing the formula
Figure BDA0003158612680000031
Deriving a vectorized representation of the document, and an advantageThe goal of the quantization is to maximize the probability of each graph appearing as a root graph belonging to itself, i.e., maximize
Figure BDA0003158612680000032
Wherein d isiVectorized representation, w, of the ith documentjAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabularyj|di) Is shown in document diChinese word wjThe conditional probability of occurrence, Φ (G), is the vectorized representation of graph G for which an embedded representation is desired,
Figure BDA0003158612680000033
representing a vectorized representation of the d-degree root child graph rooted at node n in graph G.
The identification system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root graph extraction module, an NFA graph embedded representation generation module and a classification model; wherein
The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression to obtain an NFA graph set corresponding to the regular expression;
the root subgraph extraction module is used for extracting all root subgraphs in the NFA graph;
the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;
the classification model is used for judging whether the regular expression is a state explosion type regular expression or not according to the embedded expression of the NFA graph in the regular expression.
According to the invention, firstly, a graph2vec model is selected as a basic model, the characteristics of an NFA (NFA) graph are combined, and side information in the NFA is added to optimize the input of the model, so that more correct classification is realized. The key technical scheme of the invention is introduced as follows:
1. graph2vec model introduction
The selected basic model is a graph2vec model which is a graph neural network framework for unsupervised representation learning, is used for learning the embedded representations of an undirected graph, a directed graph and a weighted graph, and can be used as input data for subsequent tasks such as graph classification. The specific design of the graph2vec model is as follows:
(1) extracting a root subgraph: firstly, a WL algorithm (shown as algorithm one) proposed by Shervashidze is used to extract a d-degree root child graph of each node in an NFA graph, and a unique label is allocated to all the root child graphs, so that each NFA graph can be regarded as being composed of a group of root child graphs, and if each graph is regarded as a document, all the root child graphs in the graph correspond to all words composing the document, as shown in fig. 1.
Figure BDA0003158612680000041
(2) Training: the WL algorithm gets the root sub-graph of all the graphs to be regarded as the vocabulary in the document, and the idea of doc2vec model is utilized: the more closely two documents consisting of the same more words should be represented in the low-dimensional vector space, the training goal of the model is to maximize the conditional probability of a word appearing in one document and minimize the conditional probability of a word not appearing in that document, i.e., obtain a vectorized representation of each document by maximizing equation (1), where diVectorized representation, w, of the ith documentjAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabularyj|di) Is shown in document diChinese word wjConditional probability of occurrence.
Figure BDA0003158612680000042
If two graphs are composed of more similar root subgraphs, the graph2vec model also expects that their low-dimensional embedded representations should be closer together, so its optimization goal is to maximize the probability of each graph appearing as a root subgraph belonging to itself, as in equation (2), where Φ (G) is the vectorized representation of graph G (NFA graph generated by regular expression) that one wishes to obtain an embedded representation, sgn (d)Representation diagramAnd G is used for vectorizing the d-degree root child graph taking the node n as the root.
Figure BDA0003158612680000043
In the model training process, the idea of a skipgram model is utilized when the formula (2) is maximized, a negative sampling technology is utilized, and a random gradient descent optimization algorithm is utilized to obtain the final graph embedding representation.
(3) Use case: after the embedded representation of each graph is obtained, the graph can be used for subsequent classification or clustering tasks, for example, an SVM algorithm can be used for carrying out secondary classification on the graph, or a K-means clustering method is used for carrying out clustering on the graph.
2. Optimizing model data input
The invention discloses an output label of a prediction NFA graph sample, and for an NFA graph set generated by a regular expression set needing to be processed, the NFA graph not only has node and edge structures, but also has more state explosion information carried by transition information on the edges, so that the processing of the edge information is added on the basis of a graph2vec model. The specific design idea is that the nodes are divided into three categories as characteristic information of the nodes by analyzing the edge character set of the nodes: nodes accelerating state explosion, nodes hindering state explosion and ordinary nodes.
(1) Node of accelerated state explosion: if the edge character set of the node n is overlapped with the prefix of the regular expression and the character set appears k times (k >1) continuously, the node n can enable the number of DFA states corresponding to the regular expression to increase rapidly, and the node n is marked as a node in an accelerated state explosion.
(2) Nodes that hinder state explosion: if there is a node n of the accelerated explosion state described above, and there is a node m before the node n, and the intersection of the edge character set of the node m and the edge character set of the node n is an empty set, then such a node m will cause the number of DFA states corresponding to the regular expression to increase slowly, and mark the node m as a node that hinders the explosion state.
(3) And (3) common nodes: all nodes except the above two types of nodes are marked as normal nodes.
After the NFA is classified according to the three types, classification characteristic information of nodes can be added into a WL algorithm, the obtained root sub-graph not only contains the structure information of the graph, but also contains information on edges, and then, the graph2vec model training is continuously utilized to obtain the embedded representation of the NFA.
Compared with the prior art, the invention has the following positive effects:
note that NFA is a graph composed of edges and nodes, and some special structures may cause the generated DFA state explosion, so when we classify regular expressions, we convert it to graph structure, solve the graph-related tasks in an end-to-end manner using Graph Neural Network (GNN), and extract the task features of its embedded representation to get a low-dimensional embedded representation of each NFA graph, so that the representations of graphs with similar structures in vector space are also close to achieve the purpose that we want to classify them using the features of graph structure.
In the process of implementing the model, node features are added according to edge information, and a complete NFA graph is decomposed into a plurality of sub-graph representations, which can further make the graph classification results with similar sub-structures the same, and the two key points are specifically described as follows.
Adding side information:
the method comprises the steps that NFA graphs corresponding to regular expressions are generated according to a Thompson construction method, information of edges in the graphs determines whether the generated DFAs can be subjected to state explosion or not, even if the two NFA graphs are isomorphic, if transfer information of two corresponding edges is different, the two NFAs can have different labels, therefore, the information of the edges needs to be added into a model, the method for adding the edge information is very important, and the characteristics of the state explosion type regular expressions need to be fully embodied. Through investigation and experimental verification of relevant documents, the model divides nodes into three classes by analyzing the edge information of the nodes, and adds the three classes as the characteristic information of the nodes into a WL algorithm, so that the extracted root sub-graph contains the characteristic information of state explosion caused by edges.
Extracting a root subgraph:
note that if two NFA graphs are composed of more identical subgraphs (containing information on edges), they have the same labels with a high probability, and experiments verify that NFAs corresponding to most state explosive regular expressions have similar substructures, therefore, in the process of model training, the WL algorithm is used to extract root subgraphs of all graphs, and the graphs are regarded as being composed of a group of root subgraphs, and the obtained graphs are embedded with information indicating more contained subgraphs, so that NFAs having similar subgraphs are also represented more closely in an embedding space.
Specifically, for one regular expression (in a regular library such as snort, l 7), the process of determining whether it generates a DFA of a state explosion is as follows: firstly, generating an NFA graph corresponding to a regular expression according to a Thompson construction method, storing information of the graph by using a gexf file, extracting all root subgraphs of the gexf graph by using a WL algorithm of fusion node classification, then training by using a graph2vec model to obtain an embedded representation of the graph, and finally processing the graph of the embedded representation by using an SVM classifier to obtain a final label: 1 (explosive) or-1 (non-explosive).
The hardware configuration of the experiment of the invention is shown in table 2:
TABLE 2 hardware configuration of the experiment
Operating system CentOS Linux release 7.5.1804
Memory device 125G
CPU Intel(R)Xeon(R)CPU E5-2667 v4@3.20GHz
Experiment design: 3520 regular expressions are selected from the regular libraries clav _ regex _ strings _2015, l7_ rules _2015 and snort _ regex _ rules _2015, are converted into corresponding NFA (N function of constraint) graphs through a Thompson construction method, and a model is trained to obtain embedded representation of each graph. In the process of SVM two-classification, because an SVM model is supervised learning, labeling each regular expression by using a threshold value method, and as a result, 1570 regular expressions (labeled as-1) and 1720 irregular expressions (labeled as 1) in 3520 regular expressions, randomly selecting 70% (3168) NFA embedded expressions as a training set during training, and the rest 30% (352) NFA embedded expressions as a test set. Setting d to be 3, 5 and 8 respectively, the experimental results are shown in table 3:
TABLE 3 comparison of the results
Figure BDA0003158612680000061
Figure BDA0003158612680000071
The experimental result proves that the highest classification accuracy rate can be achieved when d is set to be 5, namely 5-hop neighbor information is aggregated, and the classification accuracy rate is 98%.
Drawings
FIG. 1 is a comparison of the concept of the Graph2vec model and the concept of the Doc2vec model;
(a) a Doc2vec model idea, (b) a Graph2vec model idea;
FIG. 2 is a flow chart of the method of the present invention;
fig. 3 is an example of SVM binary classification.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The original input to the present invention is a set of regular expressions, such as the regular expressions shown in Table 4: "TRA [. Lambda ] {3} HL', the process of predicting whether a regular expression is state-explosive, as shown in FIG. 2, is divided into the following stages:
(1) firstly, generating an NFA (NFA) graph corresponding to each regular expression by using a Thompson construction method, and showing an NFA graph corresponding to the regular expression of TRA [ ^ A ] {3} HL ] in a table 4.
Table 4 shows an example of the execution of the wl algorithm
Figure BDA0003158612680000072
(2) NFA translates to standard graph format data: firstly, each regular expression generates a corresponding NFA graph by a Thompson construction method, and then, a gexf format file is used for recording NFA graph information, for example, the gexf file corresponding to the NFA graph structure of the regular expression "TRA [ ^ A ] {3} HL" in the storage table 4 contains node information and edge information of the NFA graph, wherein the node information includes id values of nodes, the edge information includes a start node (source) and an end node (target) of each edge, and transition characters on the edge (namely, edge characters, weight is represented by ASCII code values),
Figure BDA0003158612680000081
(3) extracting all root subgraphs of NFA: for convenience of explaining a specific process of extracting a root sub-graph after adding side information, setting d to 3, that is, aggregating nodes in a 3-hop neighborhood, taking an NFA graph in table 4 as an example, a WL algorithm extracts a 3-degree root sub-graph of each node in the graph, where a number a in "a + b" represents an iteration process of the a-th round and also represents information in an a-hop neighborhood of aggregation, and b represents a root sub-graph label obtained by aggregation. The 0 th round records the characteristic information of the node, namely the node classification result obtained by the edge information of the node, wherein "1" represents the node in the acceleration state explosion, "2" represents the node in the obstruction state explosion, and "0" represents the normal node. For example, node number 3 is labeled "1" because its outgoing character set is ^ A (all character sets except A), occurs 3 times consecutively thereafter, and has an overlap character T with the prefix T, so the node accelerates the state explosion; node number 2 is labeled "2" because its edge-out character A intersects the following character set A null, which hinders the generation of a state explosion. The first number in parentheses represents the root sub-graph label of the node in the previous round, and then the root sub-graph labels of all the out-degree nodes in the previous round, the combined sequence is mapped to the new label to obtain the label of the root sub-graph in the previous round, and finally the 3-degree root sub-graphs of all the nodes are obtained, so that the NFA in table 4 is composed of 9 3-degree root sub-graphs.
(4) Get an embedded representation of the NFA graph: after a WL algorithm is executed on each NFA graph, a set of all root subgraphs corresponding to the NFA graph can be obtained, each NFA graph is composed of a set of root subgraphs corresponding to the NFA, the corresponding relation between all the root subgraphs and the NFA graphs (to which each root subgraph belongs) is input into a known skipgram model, and the embedded representation of each NFA graph can be obtained through training, namely high-dimensional graph structure data can be represented by low-dimensional vectors with 1 x 1024 dimensions (the dimension is set manually).
(5) And (3) classification prediction: after the NFA graph is embedded into the low-dimensional vector space, the NFA represented by quantization (i.e., the embedded representation of the NFA graph obtained in the previous step) may be classified by using a classification model, for example, an SVM (support vector machine) binary classification method adopted in the present invention, and the embedded representation of the NFA graph is directly input into the classification model, so as to obtain the category of the regular expression corresponding to the NFA graph (i.e., whether the regular expression is a state explosion type regular expression). The basic idea of the SVM is to solve a separating hyperplane which can correctly divide the training data set and has the largest geometric interval, and the points in the two parts of space separated by the hyperplane are the corresponding two classification results, as shown in fig. 3.
The foregoing is merely a preferred embodiment of the present invention, and it should be understood that various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims (9)

1. A method for identifying a state explosion type regular expression comprises the following steps:
1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression;
2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;
3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression.
2. The method according to claim 1, wherein in step 1), the NFA graph corresponding to the regular expression is generated according to a Thompson construction method, and information of the NFA graph is stored in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.
3. The method according to claim 2, characterized in that the nodes are classified into three categories according to their edge character sets as the characteristic information of the nodes: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.
4. The method of claim 3, wherein the characteristic information of the node is added to the WL algorithm, and then all root subgraphs in the gexf file are extracted using the WL algorithm that fuses the node classifications.
5. The method of claim 1, wherein the classification model is an SVM classifier.
6. The method of claim 1, wherein the method of training the graph2vec model is: all root sub-graphs in the NFA graph are treated as a vocabulary in the document by maximizing the formula
Figure FDA0003158612670000011
Obtaining a vectorized representation of said document, and the optimization objective is to maximize the probability of each graph appearing as a root graph belonging to itself, i.e. to maximize
Figure FDA0003158612670000012
Wherein d isiVectorized representation, w, of the ith documentjAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabularyj|di) Is shown in document diChinese word wjThe conditional probability of occurrence, Φ (G), is the vectorized representation of graph G for which an embedded representation is desired,
Figure FDA0003158612670000013
representing a vectorized representation of the d-degree root child graph rooted at node n in graph G.
7. The identification system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root graph extraction module, an NFA graph embedded representation generation module and a classification model; wherein
The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression to obtain an NFA graph set corresponding to the regular expression;
the root subgraph extraction module is used for extracting all root subgraphs in the NFA graph;
the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;
the classification model is used for judging whether the regular expression is a state explosion type regular expression or not according to the embedded expression of the NFA graph in the regular expression.
8. The system of claim 7, wherein the NFA graph transformation module generates an NFA graph corresponding to the regular expression according to a Thompson construction method, and stores information of the NFA graph in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.
9. The system of claim 8, wherein the nodes are classified into three categories according to their edge character sets as the characteristic information of the nodes: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.
CN202110784458.0A 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression Active CN113627164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110784458.0A CN113627164B (en) 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110784458.0A CN113627164B (en) 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression

Publications (2)

Publication Number Publication Date
CN113627164A true CN113627164A (en) 2021-11-09
CN113627164B CN113627164B (en) 2024-03-01

Family

ID=78379500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110784458.0A Active CN113627164B (en) 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression

Country Status (1)

Country Link
CN (1) CN113627164B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827084A (en) * 2009-01-28 2010-09-08 丛林网络公司 The application identification efficiently of the network equipment
CN103259793A (en) * 2013-05-02 2013-08-21 东北大学 Method for inspecting deep packets based on suffix automaton regular engine structure
US20150355891A1 (en) * 2014-06-10 2015-12-10 International Business Machines Corporation Computer-based distribution of large sets of regular expressions to a fixed number of state machine engines for products and services
US20160275205A1 (en) * 2013-12-05 2016-09-22 Hewlett Packard Enterprise Development Lp Regular expression matching
CN109800337A (en) * 2018-12-06 2019-05-24 成都网安科技发展有限公司 A kind of multi-mode canonical matching algorithm suitable for big alphabet

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827084A (en) * 2009-01-28 2010-09-08 丛林网络公司 The application identification efficiently of the network equipment
CN103259793A (en) * 2013-05-02 2013-08-21 东北大学 Method for inspecting deep packets based on suffix automaton regular engine structure
US20160275205A1 (en) * 2013-12-05 2016-09-22 Hewlett Packard Enterprise Development Lp Regular expression matching
US20150355891A1 (en) * 2014-06-10 2015-12-10 International Business Machines Corporation Computer-based distribution of large sets of regular expressions to a fixed number of state machine engines for products and services
CN109800337A (en) * 2018-12-06 2019-05-24 成都网安科技发展有限公司 A kind of multi-mode canonical matching algorithm suitable for big alphabet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王翔;卢毓海;马伟;刘燕兵;: "一种针对DFA状态爆炸的正则表达式匹配方法", 计算机工程, vol. 45, no. 04, pages 148 - 156 *

Also Published As

Publication number Publication date
CN113627164B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
Kang et al. Learning with whom to share in multi-task feature learning
Liang et al. Decision tree for dynamic and uncertain data streams
CN109033833B (en) Malicious code classification method based on multiple features and feature selection
CN104951781B (en) Character recognition device and recognition function generation method
US20220179955A1 (en) Mobile malicious code classification method based on feature selection and recording medium and device for performing the same
CN110309301B (en) Enterprise category classification method and device and intelligent terminal
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
CN113434858A (en) Malicious software family classification method based on disassembly code structure and semantic features
CN114553983A (en) Deep learning-based high-efficiency industrial control protocol analysis method
Khan et al. Malware classification framework using convolutional neural network
EP3588352B1 (en) Byte n-gram embedding model
CN116318830A (en) Log intrusion detection system based on generation of countermeasure network
CN113869609A (en) Method and system for predicting confidence of frequent subgraph of root cause analysis
CN113627164B (en) Method and system for identifying state explosion type regular expression
CN116684144A (en) Malicious domain name detection method and device
Ishibuchi et al. Evolution of reference sets in nearest neighbor classification
CN116226747A (en) Training method of data classification model, data classification method and electronic equipment
CN113657443B (en) On-line Internet of things equipment identification method based on SOINN network
CN115587318A (en) Source code classification method based on neural network
CN114297375A (en) Training method and extraction method of network model of network security entity and relationship
EP4080400A1 (en) Automatic industry classification method and system
CN113626574A (en) Information query method, system, device and medium
CN112906588A (en) Riot and terrorist picture safety detection system based on deep learning
Amjad et al. A novel deep learning framework for intrusion detection system
CN112733144A (en) Malicious program intelligent detection method based on deep learning technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant