CN113627164A - Method and system for identifying state explosion type regular expression - Google Patents
Method and system for identifying state explosion type regular expression Download PDFInfo
- Publication number
- CN113627164A CN113627164A CN202110784458.0A CN202110784458A CN113627164A CN 113627164 A CN113627164 A CN 113627164A CN 202110784458 A CN202110784458 A CN 202110784458A CN 113627164 A CN113627164 A CN 113627164A
- Authority
- CN
- China
- Prior art keywords
- node
- graph
- nodes
- nfa
- regular expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000014509 gene expression Effects 0.000 title claims abstract description 92
- 238000004880 explosion Methods 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 17
- 238000013145 classification model Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 8
- 238000010276 construction Methods 0.000 claims description 12
- 230000001133 acceleration Effects 0.000 claims description 7
- 230000007704 transition Effects 0.000 claims description 7
- 230000004888 barrier function Effects 0.000 claims description 6
- 230000000903 blocking effect Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 14
- 238000012706 support-vector machine Methods 0.000 description 9
- 239000013598 vector Substances 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 4
- 239000002360 explosive Substances 0.000 description 4
- 102100029450 M1-specific T cell receptor alpha chain Human genes 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- ZPUCINDJVBIVPJ-LJISPDSOSA-N cocaine Chemical compound O([C@H]1C[C@@H]2CC[C@@H](N2C)[C@H]1C(=O)OC)C(=O)C1=CC=CC=C1 ZPUCINDJVBIVPJ-LJISPDSOSA-N 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000013139 quantization Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a system for identifying a state explosion type regular expression. The method comprises the following steps: 1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression; 2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph; 3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression. The method can efficiently and quickly process the regular expressions in batches, and meets the requirements of high-efficiency processing performance and low space consumption of an online system.
Description
Technical Field
The invention relates to a method and a system for identifying a state explosion type regular expression, and belongs to the technical field of computer software.
Background
Regular expression matching is a key component in many applications such as network filtering, for example in Deep Packet Inspection (DPI), which can enhance the security of network communications and detect the presence of malicious traffic. When the regular expression matching is completed, the regular expression should be converted into a Finite Automaton (FA) first. A finite automaton is a state machine that recognizes the same language as that represented by a regular expression, and FAs can be classified into a non-deterministic finite automaton (NFA) and a Deterministic Finite Automaton (DFA) according to whether the next state transition is determined. The expression capabilities of NFA and DFA are comparable, but since each DFA state is identical to the corresponding set of NFA states, the transition from NFA to DFA may result in a dramatic increase in the number of states, a phenomenon known as state explosion. Table 1 describes the spatial complexity versus matching temporal complexity of NFA and DFA under different coding strategies.
Table 1 shows the spatial complexity versus the matching temporal complexity for NFA and DFA under different coding strategies
1. DFAs are widely used in DPI applications with their efficient matching performance, but the explosion of DFAs's state presents significant challenges for their practical application. The existing technical method for identifying whether the regular expression can generate the DFA problem of state explosion or not is to judge whether the DFA state number is too large or not in a simple mode of setting a threshold, and if the DFA state number exceeds the threshold, the DFA is judged to be the regular expression of the state explosion type, otherwise, the DFA is not judged to be the regular expression of the state explosion type. Specifically, a threshold value is set in the process of generating the DFA by the regular expression, namely, the regular expression is firstly analyzed into an analytic tree, then the analytic tree is converted into the NFA by using a Thompson construction method or a Glushkov construction method, finally the NFA is converted into the DFA by using a subset construction method, if the state number of the generated DFA exceeds the threshold value, the DFA is judged to be state explosion, and the corresponding regular expression is a regular explosion type; otherwise, it is non-state explosion type.
The existing technical scheme is mainly to judge the state explosion type regular expression by setting a threshold, and the technology has the following defects:
1. the executable degree is low: judging whether one regular expression is in a state explosion type, a complete process of generating the DFA by the regular expression needs to be executed, the operation is complex, the algorithm processing process is slow, and the execution is difficult.
2. The space complexity is high: if the regular expression is of a state explosion type, a large number of DFA states are generated in the process of generating the DFA by the NFA, a large amount of cache space is needed for recording the states, and the space complexity is extremely high.
3. The specific structure of the state explosion type cannot be identified: there are some regular expressions whose corresponding number of DFA states does not reach the threshold, so this technique cannot identify them as state explosion type regular expressions, but they contain a specific state explosion type structure, and should be classified as state explosion type.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a method for identifying a state explosion type regular expression, which can efficiently and quickly process the regular expression in batches and meet the requirements of high-efficiency processing performance and low space consumption of an online system. The main idea of the invention is to automatically learn the structural features of the NFA graph (regular expressions are generated by Thompson construction method) by using the model of the Graph Neural Network (GNN), embed the graph representation of high dimension into the vector space of low dimension, and then classify the graphs represented vectorially by using the classification model: whether the regular expression is the NFA corresponding to the state explosion type regular expression is described in detail below.
The technical scheme of the invention is as follows:
a method for identifying a state explosion type regular expression comprises the following steps:
1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression;
2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;
3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression.
Further, in step 1), an NFA graph corresponding to the regular expression is generated according to a Thompson construction method, and information of the NFA graph is stored in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.
Further, according to the edge character set of the node, the node is divided into three categories as the characteristic information of the node: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.
Further, the characteristic information of the nodes is added into a WL algorithm, and then all root subgraphs in the gexf file are extracted by using the WL algorithm of the fusion node classification.
Further, the classification model is an SVM classifier.
Further, the method for training the graph2vec model comprises the following steps: all root sub-graphs in the NFA graph are treated as a vocabulary in the document by maximizing the formulaDeriving a vectorized representation of the document, and an advantageThe goal of the quantization is to maximize the probability of each graph appearing as a root graph belonging to itself, i.e., maximizeWherein d isiVectorized representation, w, of the ith documentjAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabularyj|di) Is shown in document diChinese word wjThe conditional probability of occurrence, Φ (G), is the vectorized representation of graph G for which an embedded representation is desired,representing a vectorized representation of the d-degree root child graph rooted at node n in graph G.
The identification system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root graph extraction module, an NFA graph embedded representation generation module and a classification model; wherein
The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression to obtain an NFA graph set corresponding to the regular expression;
the root subgraph extraction module is used for extracting all root subgraphs in the NFA graph;
the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;
the classification model is used for judging whether the regular expression is a state explosion type regular expression or not according to the embedded expression of the NFA graph in the regular expression.
According to the invention, firstly, a graph2vec model is selected as a basic model, the characteristics of an NFA (NFA) graph are combined, and side information in the NFA is added to optimize the input of the model, so that more correct classification is realized. The key technical scheme of the invention is introduced as follows:
1. graph2vec model introduction
The selected basic model is a graph2vec model which is a graph neural network framework for unsupervised representation learning, is used for learning the embedded representations of an undirected graph, a directed graph and a weighted graph, and can be used as input data for subsequent tasks such as graph classification. The specific design of the graph2vec model is as follows:
(1) extracting a root subgraph: firstly, a WL algorithm (shown as algorithm one) proposed by Shervashidze is used to extract a d-degree root child graph of each node in an NFA graph, and a unique label is allocated to all the root child graphs, so that each NFA graph can be regarded as being composed of a group of root child graphs, and if each graph is regarded as a document, all the root child graphs in the graph correspond to all words composing the document, as shown in fig. 1.
(2) Training: the WL algorithm gets the root sub-graph of all the graphs to be regarded as the vocabulary in the document, and the idea of doc2vec model is utilized: the more closely two documents consisting of the same more words should be represented in the low-dimensional vector space, the training goal of the model is to maximize the conditional probability of a word appearing in one document and minimize the conditional probability of a word not appearing in that document, i.e., obtain a vectorized representation of each document by maximizing equation (1), where diVectorized representation, w, of the ith documentjAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabularyj|di) Is shown in document diChinese word wjConditional probability of occurrence.
If two graphs are composed of more similar root subgraphs, the graph2vec model also expects that their low-dimensional embedded representations should be closer together, so its optimization goal is to maximize the probability of each graph appearing as a root subgraph belonging to itself, as in equation (2), where Φ (G) is the vectorized representation of graph G (NFA graph generated by regular expression) that one wishes to obtain an embedded representation, sgn (d)Representation diagramAnd G is used for vectorizing the d-degree root child graph taking the node n as the root.
In the model training process, the idea of a skipgram model is utilized when the formula (2) is maximized, a negative sampling technology is utilized, and a random gradient descent optimization algorithm is utilized to obtain the final graph embedding representation.
(3) Use case: after the embedded representation of each graph is obtained, the graph can be used for subsequent classification or clustering tasks, for example, an SVM algorithm can be used for carrying out secondary classification on the graph, or a K-means clustering method is used for carrying out clustering on the graph.
2. Optimizing model data input
The invention discloses an output label of a prediction NFA graph sample, and for an NFA graph set generated by a regular expression set needing to be processed, the NFA graph not only has node and edge structures, but also has more state explosion information carried by transition information on the edges, so that the processing of the edge information is added on the basis of a graph2vec model. The specific design idea is that the nodes are divided into three categories as characteristic information of the nodes by analyzing the edge character set of the nodes: nodes accelerating state explosion, nodes hindering state explosion and ordinary nodes.
(1) Node of accelerated state explosion: if the edge character set of the node n is overlapped with the prefix of the regular expression and the character set appears k times (k >1) continuously, the node n can enable the number of DFA states corresponding to the regular expression to increase rapidly, and the node n is marked as a node in an accelerated state explosion.
(2) Nodes that hinder state explosion: if there is a node n of the accelerated explosion state described above, and there is a node m before the node n, and the intersection of the edge character set of the node m and the edge character set of the node n is an empty set, then such a node m will cause the number of DFA states corresponding to the regular expression to increase slowly, and mark the node m as a node that hinders the explosion state.
(3) And (3) common nodes: all nodes except the above two types of nodes are marked as normal nodes.
After the NFA is classified according to the three types, classification characteristic information of nodes can be added into a WL algorithm, the obtained root sub-graph not only contains the structure information of the graph, but also contains information on edges, and then, the graph2vec model training is continuously utilized to obtain the embedded representation of the NFA.
Compared with the prior art, the invention has the following positive effects:
note that NFA is a graph composed of edges and nodes, and some special structures may cause the generated DFA state explosion, so when we classify regular expressions, we convert it to graph structure, solve the graph-related tasks in an end-to-end manner using Graph Neural Network (GNN), and extract the task features of its embedded representation to get a low-dimensional embedded representation of each NFA graph, so that the representations of graphs with similar structures in vector space are also close to achieve the purpose that we want to classify them using the features of graph structure.
In the process of implementing the model, node features are added according to edge information, and a complete NFA graph is decomposed into a plurality of sub-graph representations, which can further make the graph classification results with similar sub-structures the same, and the two key points are specifically described as follows.
Adding side information:
the method comprises the steps that NFA graphs corresponding to regular expressions are generated according to a Thompson construction method, information of edges in the graphs determines whether the generated DFAs can be subjected to state explosion or not, even if the two NFA graphs are isomorphic, if transfer information of two corresponding edges is different, the two NFAs can have different labels, therefore, the information of the edges needs to be added into a model, the method for adding the edge information is very important, and the characteristics of the state explosion type regular expressions need to be fully embodied. Through investigation and experimental verification of relevant documents, the model divides nodes into three classes by analyzing the edge information of the nodes, and adds the three classes as the characteristic information of the nodes into a WL algorithm, so that the extracted root sub-graph contains the characteristic information of state explosion caused by edges.
Extracting a root subgraph:
note that if two NFA graphs are composed of more identical subgraphs (containing information on edges), they have the same labels with a high probability, and experiments verify that NFAs corresponding to most state explosive regular expressions have similar substructures, therefore, in the process of model training, the WL algorithm is used to extract root subgraphs of all graphs, and the graphs are regarded as being composed of a group of root subgraphs, and the obtained graphs are embedded with information indicating more contained subgraphs, so that NFAs having similar subgraphs are also represented more closely in an embedding space.
Specifically, for one regular expression (in a regular library such as snort, l 7), the process of determining whether it generates a DFA of a state explosion is as follows: firstly, generating an NFA graph corresponding to a regular expression according to a Thompson construction method, storing information of the graph by using a gexf file, extracting all root subgraphs of the gexf graph by using a WL algorithm of fusion node classification, then training by using a graph2vec model to obtain an embedded representation of the graph, and finally processing the graph of the embedded representation by using an SVM classifier to obtain a final label: 1 (explosive) or-1 (non-explosive).
The hardware configuration of the experiment of the invention is shown in table 2:
TABLE 2 hardware configuration of the experiment
Operating system | CentOS Linux release 7.5.1804 |
Memory device | 125G |
CPU | Intel(R)Xeon(R)CPU E5-2667 v4@3.20GHz |
Experiment design: 3520 regular expressions are selected from the regular libraries clav _ regex _ strings _2015, l7_ rules _2015 and snort _ regex _ rules _2015, are converted into corresponding NFA (N function of constraint) graphs through a Thompson construction method, and a model is trained to obtain embedded representation of each graph. In the process of SVM two-classification, because an SVM model is supervised learning, labeling each regular expression by using a threshold value method, and as a result, 1570 regular expressions (labeled as-1) and 1720 irregular expressions (labeled as 1) in 3520 regular expressions, randomly selecting 70% (3168) NFA embedded expressions as a training set during training, and the rest 30% (352) NFA embedded expressions as a test set. Setting d to be 3, 5 and 8 respectively, the experimental results are shown in table 3:
TABLE 3 comparison of the results
The experimental result proves that the highest classification accuracy rate can be achieved when d is set to be 5, namely 5-hop neighbor information is aggregated, and the classification accuracy rate is 98%.
Drawings
FIG. 1 is a comparison of the concept of the Graph2vec model and the concept of the Doc2vec model;
(a) a Doc2vec model idea, (b) a Graph2vec model idea;
FIG. 2 is a flow chart of the method of the present invention;
fig. 3 is an example of SVM binary classification.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
The original input to the present invention is a set of regular expressions, such as the regular expressions shown in Table 4: "TRA [. Lambda ] {3} HL', the process of predicting whether a regular expression is state-explosive, as shown in FIG. 2, is divided into the following stages:
(1) firstly, generating an NFA (NFA) graph corresponding to each regular expression by using a Thompson construction method, and showing an NFA graph corresponding to the regular expression of TRA [ ^ A ] {3} HL ] in a table 4.
Table 4 shows an example of the execution of the wl algorithm
(2) NFA translates to standard graph format data: firstly, each regular expression generates a corresponding NFA graph by a Thompson construction method, and then, a gexf format file is used for recording NFA graph information, for example, the gexf file corresponding to the NFA graph structure of the regular expression "TRA [ ^ A ] {3} HL" in the storage table 4 contains node information and edge information of the NFA graph, wherein the node information includes id values of nodes, the edge information includes a start node (source) and an end node (target) of each edge, and transition characters on the edge (namely, edge characters, weight is represented by ASCII code values),
(3) extracting all root subgraphs of NFA: for convenience of explaining a specific process of extracting a root sub-graph after adding side information, setting d to 3, that is, aggregating nodes in a 3-hop neighborhood, taking an NFA graph in table 4 as an example, a WL algorithm extracts a 3-degree root sub-graph of each node in the graph, where a number a in "a + b" represents an iteration process of the a-th round and also represents information in an a-hop neighborhood of aggregation, and b represents a root sub-graph label obtained by aggregation. The 0 th round records the characteristic information of the node, namely the node classification result obtained by the edge information of the node, wherein "1" represents the node in the acceleration state explosion, "2" represents the node in the obstruction state explosion, and "0" represents the normal node. For example, node number 3 is labeled "1" because its outgoing character set is ^ A (all character sets except A), occurs 3 times consecutively thereafter, and has an overlap character T with the prefix T, so the node accelerates the state explosion; node number 2 is labeled "2" because its edge-out character A intersects the following character set A null, which hinders the generation of a state explosion. The first number in parentheses represents the root sub-graph label of the node in the previous round, and then the root sub-graph labels of all the out-degree nodes in the previous round, the combined sequence is mapped to the new label to obtain the label of the root sub-graph in the previous round, and finally the 3-degree root sub-graphs of all the nodes are obtained, so that the NFA in table 4 is composed of 9 3-degree root sub-graphs.
(4) Get an embedded representation of the NFA graph: after a WL algorithm is executed on each NFA graph, a set of all root subgraphs corresponding to the NFA graph can be obtained, each NFA graph is composed of a set of root subgraphs corresponding to the NFA, the corresponding relation between all the root subgraphs and the NFA graphs (to which each root subgraph belongs) is input into a known skipgram model, and the embedded representation of each NFA graph can be obtained through training, namely high-dimensional graph structure data can be represented by low-dimensional vectors with 1 x 1024 dimensions (the dimension is set manually).
(5) And (3) classification prediction: after the NFA graph is embedded into the low-dimensional vector space, the NFA represented by quantization (i.e., the embedded representation of the NFA graph obtained in the previous step) may be classified by using a classification model, for example, an SVM (support vector machine) binary classification method adopted in the present invention, and the embedded representation of the NFA graph is directly input into the classification model, so as to obtain the category of the regular expression corresponding to the NFA graph (i.e., whether the regular expression is a state explosion type regular expression). The basic idea of the SVM is to solve a separating hyperplane which can correctly divide the training data set and has the largest geometric interval, and the points in the two parts of space separated by the hyperplane are the corresponding two classification results, as shown in fig. 3.
The foregoing is merely a preferred embodiment of the present invention, and it should be understood that various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.
Claims (9)
1. A method for identifying a state explosion type regular expression comprises the following steps:
1) generating an NFA (network file access) graph corresponding to a regular expression to be identified to obtain an NFA graph set corresponding to the regular expression;
2) for each NFA graph in the NFA graph set, extracting all root graphs in the NFA graph, inputting the root graphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;
3) and processing the embedded expression of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression.
2. The method according to claim 1, wherein in step 1), the NFA graph corresponding to the regular expression is generated according to a Thompson construction method, and information of the NFA graph is stored in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.
3. The method according to claim 2, characterized in that the nodes are classified into three categories according to their edge character sets as the characteristic information of the nodes: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.
4. The method of claim 3, wherein the characteristic information of the node is added to the WL algorithm, and then all root subgraphs in the gexf file are extracted using the WL algorithm that fuses the node classifications.
5. The method of claim 1, wherein the classification model is an SVM classifier.
6. The method of claim 1, wherein the method of training the graph2vec model is: all root sub-graphs in the NFA graph are treated as a vocabulary in the document by maximizing the formulaObtaining a vectorized representation of said document, and the optimization objective is to maximize the probability of each graph appearing as a root graph belonging to itself, i.e. to maximizeWherein d isiVectorized representation, w, of the ith documentjAn embedded representation, Pr (w), representing the jth word appearing in the ith document in the vocabularyj|di) Is shown in document diChinese word wjThe conditional probability of occurrence, Φ (G), is the vectorized representation of graph G for which an embedded representation is desired,representing a vectorized representation of the d-degree root child graph rooted at node n in graph G.
7. The identification system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root graph extraction module, an NFA graph embedded representation generation module and a classification model; wherein
The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression to obtain an NFA graph set corresponding to the regular expression;
the root subgraph extraction module is used for extracting all root subgraphs in the NFA graph;
the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;
the classification model is used for judging whether the regular expression is a state explosion type regular expression or not according to the embedded expression of the NFA graph in the regular expression.
8. The system of claim 7, wherein the NFA graph transformation module generates an NFA graph corresponding to the regular expression according to a Thompson construction method, and stores information of the NFA graph in a gexf file; the information of the NFA map includes: node information of the NFA graph includes id values of nodes, and side information includes start nodes, end nodes, and transition characters on edges of each edge.
9. The system of claim 8, wherein the nodes are classified into three categories according to their edge character sets as the characteristic information of the nodes: accelerating state explosion nodes, blocking state explosion nodes and common nodes; wherein, the node of the explosion in the acceleration state is: if the edge character set of the node n is overlapped with the prefix of the regular expression and the edge character set continuously appears for k times, marking the node n as an explosion node in an acceleration state; the nodes of the barrier state explosion are as follows: if a node n in an accelerated state explosion exists and a node m exists before the node n, and the intersection of the outgoing edge character set of the node m and the outgoing edge character set of the node n is an empty set, marking the node m as a node in a barrier state explosion; and marking the nodes except the nodes for accelerating state explosion and the nodes for blocking state explosion as common nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110784458.0A CN113627164B (en) | 2021-07-12 | 2021-07-12 | Method and system for identifying state explosion type regular expression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110784458.0A CN113627164B (en) | 2021-07-12 | 2021-07-12 | Method and system for identifying state explosion type regular expression |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113627164A true CN113627164A (en) | 2021-11-09 |
CN113627164B CN113627164B (en) | 2024-03-01 |
Family
ID=78379500
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110784458.0A Active CN113627164B (en) | 2021-07-12 | 2021-07-12 | Method and system for identifying state explosion type regular expression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113627164B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101827084A (en) * | 2009-01-28 | 2010-09-08 | 丛林网络公司 | The application identification efficiently of the network equipment |
CN103259793A (en) * | 2013-05-02 | 2013-08-21 | 东北大学 | Method for inspecting deep packets based on suffix automaton regular engine structure |
US20150355891A1 (en) * | 2014-06-10 | 2015-12-10 | International Business Machines Corporation | Computer-based distribution of large sets of regular expressions to a fixed number of state machine engines for products and services |
US20160275205A1 (en) * | 2013-12-05 | 2016-09-22 | Hewlett Packard Enterprise Development Lp | Regular expression matching |
CN109800337A (en) * | 2018-12-06 | 2019-05-24 | 成都网安科技发展有限公司 | A kind of multi-mode canonical matching algorithm suitable for big alphabet |
-
2021
- 2021-07-12 CN CN202110784458.0A patent/CN113627164B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101827084A (en) * | 2009-01-28 | 2010-09-08 | 丛林网络公司 | The application identification efficiently of the network equipment |
CN103259793A (en) * | 2013-05-02 | 2013-08-21 | 东北大学 | Method for inspecting deep packets based on suffix automaton regular engine structure |
US20160275205A1 (en) * | 2013-12-05 | 2016-09-22 | Hewlett Packard Enterprise Development Lp | Regular expression matching |
US20150355891A1 (en) * | 2014-06-10 | 2015-12-10 | International Business Machines Corporation | Computer-based distribution of large sets of regular expressions to a fixed number of state machine engines for products and services |
CN109800337A (en) * | 2018-12-06 | 2019-05-24 | 成都网安科技发展有限公司 | A kind of multi-mode canonical matching algorithm suitable for big alphabet |
Non-Patent Citations (1)
Title |
---|
王翔;卢毓海;马伟;刘燕兵;: "一种针对DFA状态爆炸的正则表达式匹配方法", 计算机工程, vol. 45, no. 04, pages 148 - 156 * |
Also Published As
Publication number | Publication date |
---|---|
CN113627164B (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kang et al. | Learning with whom to share in multi-task feature learning | |
Liang et al. | Decision tree for dynamic and uncertain data streams | |
CN109033833B (en) | Malicious code classification method based on multiple features and feature selection | |
CN104951781B (en) | Character recognition device and recognition function generation method | |
US20220179955A1 (en) | Mobile malicious code classification method based on feature selection and recording medium and device for performing the same | |
CN110309301B (en) | Enterprise category classification method and device and intelligent terminal | |
CN115357904B (en) | Multi-class vulnerability detection method based on program slicing and graph neural network | |
CN113434858A (en) | Malicious software family classification method based on disassembly code structure and semantic features | |
CN114553983A (en) | Deep learning-based high-efficiency industrial control protocol analysis method | |
Khan et al. | Malware classification framework using convolutional neural network | |
EP3588352B1 (en) | Byte n-gram embedding model | |
CN116318830A (en) | Log intrusion detection system based on generation of countermeasure network | |
CN113869609A (en) | Method and system for predicting confidence of frequent subgraph of root cause analysis | |
CN113627164B (en) | Method and system for identifying state explosion type regular expression | |
CN116684144A (en) | Malicious domain name detection method and device | |
Ishibuchi et al. | Evolution of reference sets in nearest neighbor classification | |
CN116226747A (en) | Training method of data classification model, data classification method and electronic equipment | |
CN113657443B (en) | On-line Internet of things equipment identification method based on SOINN network | |
CN115587318A (en) | Source code classification method based on neural network | |
CN114297375A (en) | Training method and extraction method of network model of network security entity and relationship | |
EP4080400A1 (en) | Automatic industry classification method and system | |
CN113626574A (en) | Information query method, system, device and medium | |
CN112906588A (en) | Riot and terrorist picture safety detection system based on deep learning | |
Amjad et al. | A novel deep learning framework for intrusion detection system | |
CN112733144A (en) | Malicious program intelligent detection method based on deep learning technology |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |