CN113627164B - Method and system for identifying state explosion type regular expression - Google Patents

Method and system for identifying state explosion type regular expression Download PDF

Info

Publication number
CN113627164B
CN113627164B CN202110784458.0A CN202110784458A CN113627164B CN 113627164 B CN113627164 B CN 113627164B CN 202110784458 A CN202110784458 A CN 202110784458A CN 113627164 B CN113627164 B CN 113627164B
Authority
CN
China
Prior art keywords
node
graph
nfa
regular expression
state explosion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110784458.0A
Other languages
Chinese (zh)
Other versions
CN113627164A (en
Inventor
卢毓海
王晓琳
张春燕
刘燕兵
谭建龙
郭莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202110784458.0A priority Critical patent/CN113627164B/en
Publication of CN113627164A publication Critical patent/CN113627164A/en
Application granted granted Critical
Publication of CN113627164B publication Critical patent/CN113627164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a system for identifying a state explosion type regular expression. The method comprises the following steps: 1) For a regular expression to be identified, generating a corresponding NFA graph of the regular expression to obtain a NFA graph set corresponding to the regular expression; 2) For each NFA graph in the NFA graph set, extracting all root subgraphs in the NFA graph, inputting the root subgraphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph; 3) Processing the embedded representation of the NFA graph with a classification model to determine whether the regular expression is a state-explosion type regular expression. The method can efficiently and rapidly process regular expressions in batches, and meets the requirements of high-efficiency processing performance and lower space consumption of an online system.

Description

Method and system for identifying state explosion type regular expression
Technical Field
The invention relates to a method and a system for identifying a state explosion type regular expression, and belongs to the technical field of computer software.
Background
Regular expression matching is a key component in many applications such as network filtering, for example in Deep Packet Inspection (DPI), which can enhance the security of network communications and detect the presence of malicious traffic. Upon completion of regular expression matching, the regular expression should first be converted into a Finite Automaton (FA). The finite automaton is a state machine that recognizes the same language as that expressed by a regular expression, and can be classified into a non-deterministic finite automaton (NFA) and a Deterministic Finite Automaton (DFA) according to whether or not a next state transition is determined. The expression capacity of NFA and DFA is comparable, but since each DFA state is equivalent to a corresponding set of NFA states, the transition from NFA to DFA may lead to a sharp increase in the number of states, a phenomenon known as state explosion. Table 1 describes the spatial complexity versus the temporal complexity of the matching for NFA and DFA under different coding strategies.
Table 1 shows the comparison of the spatial complexity and the matching temporal complexity of NFA and DFA under different compiling strategies
1. DFA is widely used in DPI applications with its efficient matching performance, but state explosion of DFA presents a significant challenge for practical use of DFA. The existing technical method for identifying whether the regular expression will generate the DFA problem of state explosion is to simply set a threshold value to judge whether the DFA state number is too large, and if the threshold value is exceeded, the DFA state number is judged to be the state explosion type regular expression, otherwise, the DFA state number is not judged to be the state explosion type regular expression. Specifically, a threshold value is set in the process of generating the DFA by the regular expression, namely the regular expression is firstly analyzed into an analysis tree, then the analysis tree is converted into the NFA by using a Thompson construction method or a Glushkov construction method, finally the analysis tree is converted into the DFA by using a subset construction method, and if the generated DFA state number exceeds the threshold value, the DFA is judged to be in state explosion, and the corresponding regular expression is in a regular explosion type; otherwise, the system is in a non-state explosion type.
The existing technical scheme mainly judges the state explosion type regular expression by setting a threshold value, and the technology has the following defects:
1. the executable degree is low: judging whether a regular expression is of a state explosion type or not, and executing a complete process of generating the DFA by the regular expression is complex in operation, and the algorithm processing process is slow and is not easy to execute.
2. The spatial complexity is high: if the regular expression is a state explosion type, a large number of DFA states are generated in the process of generating DFA by NFA, and a large amount of buffer space is required to record the states, which is extremely complicated in space.
3. The specific structure of the state explosion cannot be identified: there are some regular expressions whose corresponding DFA state numbers do not reach the threshold, so the technique cannot identify regular expressions that are state-explosive, but they contain a specific state-explosive structure, which should be classified as state-explosive.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention aims to provide a method for identifying a state explosion type regular expression, which can efficiently and rapidly process the regular expression in batches and meet the requirements of high-efficiency processing performance and lower space consumption of an online system. The main idea of the invention is to automatically learn the structural characteristics of an NFA graph (regular expression is generated by using Thompson construction method) by using a Graph Neural Network (GNN) model, embed high-dimensional graph representations into a low-dimensional vector space, and then perform two-classification on the vectorized representation graphs by using a classification model: whether the NFA is a state explosion type regular expression corresponds to the NFA is described in detail in the technical scheme below.
The technical scheme of the invention is as follows:
a method for identifying a state explosion type regular expression comprises the following steps:
1) For a regular expression to be identified, generating a corresponding NFA graph of the regular expression to obtain a NFA graph set corresponding to the regular expression;
2) For each NFA graph in the NFA graph set, extracting all root subgraphs in the NFA graph, inputting the root subgraphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;
3) Processing the embedded representation of the NFA graph with a classification model to determine whether the regular expression is a state-explosion type regular expression.
Further, in step 1), generating an NFA graph corresponding to the regular expression according to a Thompson construction method, and storing information of the NFA graph by using a gexf file; the information of the NFA graph includes: node information and side information of the NFA graph, wherein the node information includes an id value of a node, and the side information includes a start node, an end node, and a transfer character on an edge of each edge.
Further, according to the out-edge character set of the node, the node is divided into three types as characteristic information of the node: accelerating the state explosion, blocking the state explosion and the common node; wherein, the node of the acceleration state explosion is: if the edge-out character set of the node n is overlapped with the prefix of the regular expression and the edge-out character set appears k times continuously, marking the node n as a node for accelerating state explosion; the nodes of the obstructing state explosion are as follows: if a node n accelerating state explosion exists and a node m exists in front of the node n, marking the node m as a node obstructing state explosion, wherein the intersection of an outgoing character set of the node n and the outgoing character set of the node n is an empty set; the nodes other than the nodes accelerating the state explosion and the nodes obstructing the state explosion are marked as common nodes.
Further, the characteristic information of the nodes is added into a WL algorithm, and then all root subgraphs in the gexf file are extracted by using the WL algorithm fused with node classification.
Further, the classification model is an SVM classifier.
Further, the method for training the graph2vec model comprises the following steps: treating all root subgraphs in the NFA graph as a vocabulary in the document by maximizing the formulaThe vectorized representation of the document is obtained and the optimization objective is to maximize the probability of the root graph belonging to itself appearing in each graph, i.e. maximize +.>Wherein d i Vectorized representation representing the ith document, w j An embedded representation representing the jth word in the vocabulary as it appears in the ith document, pr (w j |d i ) Represented in document d i Chinese word w j Conditional probability of occurrence, Φ (G) is the vectorized representation of the graph G for which an embedded representation is desired, +.>A vectorized representation of a d-degree root graph rooted at node n in graph G is shown.
The recognition system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root chart extraction module, an embedded representation generation module of the NFA graph and a classification model; wherein the method comprises the steps of
The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression and obtaining an NFA graph set corresponding to the regular expression;
the root graph extraction module is used for extracting all root graphs in the NFA graph;
the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;
the classification model is used for judging whether the regular expression is a state explosion type regular expression according to the embedded representation of the NFA graph in the regular expression.
According to the invention, a graph2vec model is selected as a basic model, and the characteristics of an NFA graph are combined, and the side information in the NFA is added to optimize the input of the model, so that more correct classification is realized. The key technical scheme of the invention is described as follows:
1. introduction to graph2vec model
The selected basic model is a graph2vec model, which is a graph neural network framework for unsupervised representation learning, is used for learning undirected graphs, directed graphs and embedded representations of weighted graphs, and can be used as input data for subsequent tasks such as graph classification and the like. The specific design of the graph2vec model is as follows:
(1) Extracting a root graph: first, the WL algorithm proposed by shaovashidze (shown as algorithm one) is used to extract the d-degree root graphs of each node in the NFA graphs, and assign a unique label to all the root graphs, so that each NFA graph can be regarded as being composed of a set of root graphs, and if each graph is regarded as a document, all the root graphs in the graph correspond to all the words that make up the document, as shown in fig. 1.
(2) Training: the root graphs of all graphs obtained by the WL algorithm can be regarded as vocabularies in the document, and the idea of the doc2vec model is utilized: the training goal of the model is to maximize the conditional probability of a word that appears in one document, and minimize the conditional probability of a word that does not appear in that document, i.e., by maximizing equation (1) to obtain a vectorized representation of each document, the closer the representation in the low-dimensional vector space of two documents that consist of more identical words should be, where d i Vectorized representation representing the ith document, w j An embedded representation representing the jth word in the vocabulary as it appears in the ith document, pr (w j |d i ) Represented in document d i Chinese word w j Conditional probability of occurrence.
If the two graphs consist of more similar root graphs, the graph2vec model also expects that their low-dimensional embedded representations should be closer, so its optimization goal is to maximize the probability of each graph appearing in its own root graph, as in equation (2), where Φ (G) is the vectorized representation of graph G (NFA graph generated from regular expressions) that is expected to get an embedded representation, sg n (d) A vectorized representation of a d-degree root graph rooted at node n in graph G is shown.
In the model training process, the thought of a skip gram model is utilized when the formula (2) is maximized, and a random gradient descent optimization algorithm is utilized to obtain a final graph embedded representation by a negative sampling technology.
(3) Use case: after obtaining the embedded representation of each graph, the embedded representation can be used for subsequent classification or clustering tasks, for example, the graph can be classified into two categories by using an SVM algorithm, or the graph can be clustered by using a K-means clustering method.
2. Optimizing model data input
The invention is an output label of a prediction NFA graph sample, and regarding an NFA graph set generated by a regular expression set to be processed, note that the NFA graph not only has a node and an edge structure, but also has transfer information on the edge carrying more state explosion information, so that the processing of the edge information is added on the basis of a graph2vec model. The specific design thought is that the nodes are divided into three types by analyzing the edge character set of the nodes, and the three types are used as characteristic information of the nodes: a node accelerating the state explosion, a node obstructing the state explosion, and a common node.
(1) Nodes accelerating state explosions: if the edge character set of the node n overlaps with the prefix of the regular expression and the character set appears k times (k > 1) consecutively, the node n can cause the DFA state number corresponding to the regular expression to increase sharply, and the node n is marked as a node accelerating state explosion.
(2) Nodes that hinder state explosions: if there is a node n of the accelerated state explosion described above, and there is a node m before the node n, where the intersection of the set of edge characters and the set of edge characters of the node n is an empty set, such a node m will cause the DFA state number corresponding to the regular expression to grow slowly, and the node m is marked as a node of the blocked state explosion.
(3) Common node: all nodes except the two types of nodes are marked as common nodes.
After classifying the NFA according to the three types, node classification characteristic information can be added into the WL algorithm, the obtained root graph not only contains the structural information of the graph, but also contains the information of edges, and then the embedded representation of the NFA graph is obtained by continuously training by using a graph2vec model.
Compared with the prior art, the invention has the following positive effects:
note that NFA is a graph consisting of edges and nodes, and some special structures can cause explosion of the generated DFA states, so when a regular expression is classified into two, we convert it to graph structures, and use Graph Neural Network (GNN) to solve tasks related to the graph in an end-to-end manner, and extract task features of its embedded representation, so as to obtain a low-dimensional embedded representation of each NFA graph, so that the representation of graphs with similar structures in vector space is also similar, so as to achieve the purpose that we want to classify it by using features of the graph structures.
In the process of realizing the model, node characteristics are added according to the side information, and a complete NFA graph is decomposed into a plurality of sub-graph representations, so that graph classification results with similar substructures can be further identical, and the two key points are specifically described below.
Information of the joining side:
according to the Thompson construction method, generating an NFA graph corresponding to the regular expression, wherein the information of the edges in the graph determines whether the generated DFA can explode, even if two NFA graphs are isomorphic, if the transfer information on two corresponding edges is different, the two NFAs can also have different labels, so that the information of the edges needs to be added in the model, the method for adding the edge information is important, and the characteristics of the state explosion type regular expression need to be fully reflected. Through research and experimental verification of related documents, the model adopts the method that the node is divided into three types by analyzing the edge information of the node, and the node is taken as the characteristic information of the node and added into a WL algorithm, so that the extracted root map contains the characteristic information of the edge causing state explosion.
Extracting a root graph:
note that if two NFA graphs are composed of more identical sub-graphs (containing information on edges), they will have the same labels with a high probability, and experiments verify that NFA corresponding to most state explosive regular expressions have similar sub-structures, therefore, during model training, the WL algorithm is used to extract root graphs of all graphs, and the graphs are regarded as being composed of a set of root graphs, the resulting graph embeds information representing more sub-graphs, so that the representation of NFA with similar sub-graphs in the embedding space is also closer.
Specifically, for a regular expression (in a regular library such as snort, l 7), the process of determining whether a DFA generates a state explosion is as follows: firstly, generating an NFA graph corresponding to a regular expression according to a Thompson construction method, storing information of the graph by using a gexf file, extracting all root subgraphs of the gexf graph by using a WL algorithm of fusion node classification, training by using a graph2vec model to obtain an embedded representation of the graph, and finally processing the graph of the embedded representation by using an SVM classifier to obtain a final label: 1 (explosive) or-1 (non-explosive).
The hardware configuration of the experiment of the invention is shown in Table 2:
table 2 hardware configuration of experiments
Operating system CentOS Linux release 7.5.1804
Memory 125G
CPU Intel(R)Xeon(R)CPU E5-2667 v4@3.20GHz
Experiment design: 3520 regular expressions are selected from the regular libraries clamav_regex_strings_2015, l7_rules_2015 and snort_regex_rules_2015, and are converted into corresponding NFA graphs through a Thompson construction method, and an embedded representation of each graph is obtained through training a model. In the process of SVM two classification, since the SVM model is supervised learning, each regular expression is labeled by a threshold value method, and 1570 regular explosion types (label is-1) in 3520 regular expressions are obtained as a result, 1720 irregular explosion types (label is-1), 70% (3168) NFA embedded representations are randomly selected as a training set during training, and the remaining 30% (352) NFA embedded representations are selected as a test set. D=3, 5,8 were set respectively, and experimental results are shown in table 3:
table 3 comparison of experimental results
Experimental results prove that the highest classification accuracy can be achieved when d=5 is set, namely, 5-hop neighbor information is aggregated, and the highest classification accuracy is 98%.
Drawings
FIG. 1 is a Graph2vec model vs. Doc2vec model ideas;
(a) Doc2vec model idea, (b) Graph2vec model idea;
FIG. 2 is a flow chart of the method of the present invention;
fig. 3 shows an SVM classification example.
Detailed Description
Preferred examples of the present invention will be described in detail below with reference to the accompanying drawings.
The original input of the present invention is a regular expression set, such as the regular expression shown in Table 4: "TRA [ ≡A ] {3} HL", then as shown in FIG. 2, the process of predicting whether a regular expression is of the state explosion type is divided into the following stages:
(1) Regular expression generation NFA first generates its corresponding NFA graph structure for each regular expression using Thompson construction, table 4 shows the NFA graph corresponding to the regular expression "TRA [ ≡A ] {3} HL".
Table 4 shows an example of the wl algorithm execution process
(2) NFA was converted to standard graph format data: firstly, each regular expression generates a corresponding NFA graph by a Thompson construction method, then uses a gexf format file to record NFA graph information, for example, a gexf file which stores the regular expression 'TRA [ ≡A ] {3} HL' corresponding to the NFA graph structure in table 4 is shown as follows and comprises node information and side information of the NFA graph, wherein the node information comprises id values of nodes, the side information comprises a start node (source) and a termination node (target) of each side, and transfer characters (namely, out-side characters, weight is represented by ASCII code values) of the sides,
(3) Extracting all root subgraphs of the NFA: in order to facilitate explanation of a specific process of extracting a root subgraph after adding side information, d=3 is set, namely nodes in 3-hop neighborhood are aggregated, taking an NFA graph in table 4 as an example, a WL algorithm extracts 3-degree root subgraphs of each node in the graph, wherein a number a in "a+b" represents an iterative process of a-th round, information in the aggregated a-hop neighborhood is also represented, and a number b represents a root subgraph label obtained by aggregation. The 0 th round records characteristic information of the nodes, namely node classification results obtained by the side information of the nodes, wherein '1' represents the node for accelerating the state explosion, '2' represents the node for blocking the state explosion, and '0' represents the common node. For example, node 3 is labeled "1" because its out-edge character set is A (all character sets except A), appears 3 times in succession thereafter, and has an overlapping character T with the prefix T, thus the node accelerates the state explosion; and node number 2 is labeled "2" because its intersection of the out-edge character A with the following character set A is empty, which would prevent the generation of a state explosion. The first number in brackets represents the root sub-graph number of the node in the previous round, then the root sub-graph number of the previous round is used for all the degree-output nodes, the combined sequence is mapped to the new number to obtain the number of the root sub-graph, finally the 3 degree root sub-graph of all the nodes is obtained, and then NFA in table 4 is composed of 9 3 degree root sub-graphs.
(4) Obtaining an embedded representation of the NFA graph: after WL algorithm is executed on each NFA graph, a set of all root subgraphs corresponding to the NFA graph can be obtained, then each NFA graph is formed by a root subgraph set corresponding to the NFA, all root subgraphs and the corresponding relationship between the root subgraphs and the NFA graph (which NFA graphs each root subgraph belongs to) are input into a known skip gram model, and embedded representation of each NFA graph can be obtained through training, namely, high-dimensional graph structure data can be represented by using low-dimensional vectors with dimensions of 1 x 1024 dimensions (the dimension is set manually).
(5) Classification prediction: after embedding the NFA graph into a low-dimensional vector space, a classification model can be used to classify the NFA (i.e. the embedded representation of the NFA graph obtained in the last step) of the vector representation, for example, according to the SVM (support vector machine) classification method adopted by the invention, the embedded representation of the NFA graph is directly input into the classification model, and the class of the regular expression corresponding to the NFA graph (i.e. whether the regular expression is a state explosion type regular expression) can be obtained. The basic idea of the SVM is to solve a separation hyperplane that can correctly divide the training data set and has the largest geometric interval, and points in two parts of space separated by the hyperplane are two corresponding classification results, as shown in fig. 3.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and enhancements can be made to the present invention by those of ordinary skill in the art without departing from the principles of the present invention, and that various substitutions, alterations and modifications are possible without departing from the spirit and scope of the present invention. The invention should not be limited to the embodiments of the present description and the disclosure of the drawings, but the scope of the invention is defined by the claims.

Claims (4)

1. A method for identifying a state explosion type regular expression comprises the following steps:
1) For a regular expression to be identified, generating an NFA image corresponding to the regular expression according to a Thompson construction method, and storing information of the NFA image by using a gexf file to obtain an NFA image set corresponding to the regular expression; the information of the NFA graph includes: node information and side information of the NFA graph, wherein the node information comprises an id value of a node, and the side information comprises a start node, a stop node and transfer characters on the sides of each side; according to the out-edge character set of the node, the node is divided into three types as characteristic information of the node: accelerating the state explosion, blocking the state explosion and the common node; wherein, the node of the acceleration state explosion is: if the edge-out character set of the node n is overlapped with the prefix of the regular expression and the edge-out character set appears k times continuously, marking the node n as a node for accelerating state explosion; the nodes of the obstructing state explosion are as follows: if a node n accelerating state explosion exists and a node m exists in front of the node n, marking the node m as a node obstructing state explosion, wherein the intersection of an outgoing character set of the node n and the outgoing character set of the node n is an empty set; marking the nodes accelerating the state explosion and the rest nodes except the nodes obstructing the state explosion as common nodes;
2) For each NFA graph in the NFA graph set, adding the characteristic information of the nodes of the NFA graph into a WL algorithm, extracting all root subgraphs in a gexf file corresponding to the NFA graph by using the WL algorithm of fusion node classification, inputting the root subgraphs into a graph2vec model, and training to obtain an embedded representation of the NFA graph;
3) And processing the embedded representation of the NFA graph by using a classification model, and judging whether the regular expression is a state explosion type regular expression or not.
2. The method of claim 1, wherein the classification model is an SVM classifier.
3. The method of claim 1, wherein the method of training the graph2vec model is: treating all root subgraphs in the NFA graph as a vocabulary in the document by maximizing the formulaThe vectorized representation of the document is obtained and the optimization objective is to maximize the probability of each graph appearing in its own root graph, i.e. to maximizeWherein d i Representing the ith articleVectorized representation of gear, w j An embedded representation representing the jth word in the vocabulary as it appears in the ith document, pr (w j |d i ) Represented in document d i Chinese word w j Conditional probability of occurrence, Φ (G) is the vectorized representation of the graph G for which an embedded representation is desired, +.>A vectorized representation of a d-degree root graph rooted at node n in graph G is shown.
4. The recognition system of the state explosion type regular expression is characterized by comprising an NFA graph conversion module, a root chart extraction module, an embedded representation generation module of the NFA graph and a classification model; wherein the method comprises the steps of
The NFA graph conversion module is used for generating an NFA graph corresponding to the regular expression according to the Thompson construction method, storing information of the NFA graph by using a gexf file, and obtaining an NFA graph set corresponding to the regular expression; the information of the NFA graph includes: node information and side information of the NFA graph, wherein the node information comprises an id value of a node, and the side information comprises a start node, a stop node and transfer characters on the sides of each side; according to the out-edge character set of the node, the node is divided into three types as characteristic information of the node: accelerating the state explosion, blocking the state explosion and the common node; wherein, the node of the acceleration state explosion is: if the edge-out character set of the node n is overlapped with the prefix of the regular expression and the edge-out character set appears k times continuously, marking the node n as a node for accelerating state explosion; the nodes of the obstructing state explosion are as follows: if a node n accelerating state explosion exists and a node m exists in front of the node n, marking the node m as a node obstructing state explosion, wherein the intersection of an outgoing character set of the node n and the outgoing character set of the node n is an empty set; marking the nodes accelerating the state explosion and the rest nodes except the nodes obstructing the state explosion as common nodes;
the root graph extraction module is used for adding the characteristic information of the nodes of the NFA graph into a WL algorithm for each NFA graph in the NFA graph set, and extracting all root graphs in the gexf file corresponding to the NFA graph by using the WL algorithm fused with node classification;
the embedded representation generation module of the NFA graph is used for training all root subgraphs in the NFA graph by using a graph2vec model to obtain the embedded representation of the NFA graph;
the classification model is used for judging whether the regular expression is a state explosion type regular expression according to the embedded representation of the NFA graph in the regular expression.
CN202110784458.0A 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression Active CN113627164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110784458.0A CN113627164B (en) 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110784458.0A CN113627164B (en) 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression

Publications (2)

Publication Number Publication Date
CN113627164A CN113627164A (en) 2021-11-09
CN113627164B true CN113627164B (en) 2024-03-01

Family

ID=78379500

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110784458.0A Active CN113627164B (en) 2021-07-12 2021-07-12 Method and system for identifying state explosion type regular expression

Country Status (1)

Country Link
CN (1) CN113627164B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827084A (en) * 2009-01-28 2010-09-08 丛林网络公司 The application identification efficiently of the network equipment
CN103259793A (en) * 2013-05-02 2013-08-21 东北大学 Method for inspecting deep packets based on suffix automaton regular engine structure
CN109800337A (en) * 2018-12-06 2019-05-24 成都网安科技发展有限公司 A kind of multi-mode canonical matching algorithm suitable for big alphabet

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10242125B2 (en) * 2013-12-05 2019-03-26 Entit Software Llc Regular expression matching
US10042654B2 (en) * 2014-06-10 2018-08-07 International Business Machines Corporation Computer-based distribution of large sets of regular expressions to a fixed number of state machine engines for products and services

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827084A (en) * 2009-01-28 2010-09-08 丛林网络公司 The application identification efficiently of the network equipment
CN103259793A (en) * 2013-05-02 2013-08-21 东北大学 Method for inspecting deep packets based on suffix automaton regular engine structure
CN109800337A (en) * 2018-12-06 2019-05-24 成都网安科技发展有限公司 A kind of multi-mode canonical matching algorithm suitable for big alphabet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种针对DFA状态爆炸的正则表达式匹配方法;王翔;卢毓海;马伟;刘燕兵;;计算机工程;第45卷(第04期);148-156 *

Also Published As

Publication number Publication date
CN113627164A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN110059181B (en) Short text label method, system and device for large-scale classification system
US9558299B2 (en) Submatch extraction
CN113434858B (en) Malicious software family classification method based on disassembly code structure and semantic features
CN109033833B (en) Malicious code classification method based on multiple features and feature selection
US20220179955A1 (en) Mobile malicious code classification method based on feature selection and recording medium and device for performing the same
US20230334154A1 (en) Byte n-gram embedding model
CN115357904B (en) Multi-class vulnerability detection method based on program slicing and graph neural network
Khan et al. Malware classification framework using convolutional neural network
CN115361176A (en) SQL injection attack detection method based on FlexUDA model
Xu et al. DHA: Supervised deep learning to hash with an adaptive loss function
CN113627164B (en) Method and system for identifying state explosion type regular expression
CN111209373A (en) Sensitive text recognition method and device based on natural semantics
Machado et al. Improving face detection
CN112380535B (en) CBOW-based malicious code three-channel visual identification method
CN113869398A (en) Unbalanced text classification method, device, equipment and storage medium
CN114139153A (en) Graph representation learning-based malware interpretability classification method
CN113626574A (en) Information query method, system, device and medium
CN113657443A (en) Online Internet of things equipment identification method based on SOINN network
CN112733144A (en) Malicious program intelligent detection method based on deep learning technology
CN112906588A (en) Riot and terrorist picture safety detection system based on deep learning
CN111581640A (en) Malicious software detection method, device and equipment and storage medium
Ssebulime Email classification using machine learning techniques
Ji et al. Multi-Scale Defense of Adversarial Images
Minařík et al. Recognition of randomly deformed objects
US20230229740A1 (en) Multiclass classification apparatus and method robust to imbalanced data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant