CN115858405A

CN115858405A - Grammar perception fuzzy test method and system for code test

Info

Publication number: CN115858405A
Application number: CN202310194536.0A
Authority: CN
Inventors: 毛得明; 茹凯琪; 吴春明; 曹夕; 卞绪杰
Original assignee: CETC 30 Research Institute
Current assignee: CETC 30 Research Institute
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2023-03-28

Abstract

The invention relates to the technical field of grammar perception fuzzy test, and discloses a grammar perception fuzzy test method and a system for code test, wherein the method adopts a deep learning mode to carry out grammar perception learning on a fragile source code, generates a new code segment, and realizes the fuzzy test of the code by using the new code segment; wherein, the fragile source code refers to the source code which has been disclosed to have security defect. The invention solves the problems that the code segment is difficult to automatically generate, the programming language specification inspection is difficult to accurately realize and the like in the prior art.

Description

Grammar perception fuzzy test method and system for code test

Technical Field

The invention relates to the technical field of grammar perception fuzzy test, in particular to a grammar perception fuzzy test method and a grammar perception fuzzy test system for code test.

Background

The grammar-aware fuzzy test is mainly a method for carrying out automatic vulnerability mining on key basic software such as an interpreter and a compiler. The method is characterized in that a test case is automatically generated by recognizing grammatical semantics, and an attempt is made to enable a target application program to be abnormal, so that a security defect is discovered.

Software such as an interpreter and a compiler is a key infrastructure in the software field, and if a problem occurs, a great potential safety hazard is caused. The behavior of the generated executable program and the source program semantic may be inconsistent, so that unexpected errors occur to the program, which are not easy to detect and find for application program developers and easily cause serious online accidents, and therefore, the reliability of the interpreter (or compiler) is very critical. However, fuzz testing of an interpreter or a compiler is difficult, and highly structured input data is required. The test case for an interpreter or compiler is a segment of code that needs to meet the programming language specification. If the generated test case can not meet the programming specification, the interpreter (or compiler) will report syntax error in advance, so that the deep test can not be carried out.

Conventional fuzz testing methods can be divided into variation-based fuzz testing and generation-based fuzz testing. The fuzzy test method based on variation is to apply variation technology to the existing data sample to create a test case; the fuzzy test method based on generation is to generate a test case from the beginning by a method of modeling a target protocol or a file format. It is difficult to generate highly structured data based on both variation-based and generation-based fuzz testing methods.

In order to solve the problem of test case generation, the existing scheme is to extract code segments, then combine the code segments, including variables, expressions, operators and the like, and add language legal detection to generate a test code with controllable code length, controllable code nesting depth, controllable transmission parameter quantity and the like, and ensure accurate grammar. However, the combination of the code segments generated in the way is simpler, and the analysis and utilization of the code data stream are lacked; secondly, a test case library can be constructed, when the test case library is large enough, certain potential safety hazards can be eliminated, and the method has the difficulty that test cases of a programming language are always limited and cannot cope with an infinite state space; because of the more specifications of programming languages, the general test case generation method is difficult to cover a larger state space, and some researches are currently carried out to perform fuzzy tests aiming at specific language characteristics.

The technical problems of the prior art mainly include the following two points:

1. code fragments automatically generate problems:

the code segment generated by adopting the code segment combination method has little difference with the original method, and a great deal of manpower and material resources are needed to be consumed for constructing various code libraries, so that a method capable of automatically generating the code segment is needed to be provided for efficiently carrying out fuzzy test on an interpreter and a compiler.

2. Programming language specification checking problem:

after the code fragments are automatically generated, the generated code fragments need to be checked to see if the programming language specifications are met, and if the specifications are not met, corresponding adjustments need to be made to the code fragments. By combining the code segments predicted by the deep learning model with the original code segments, variable reference errors are easily caused.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a grammar perception fuzzy test method and a grammar perception fuzzy test system for code testing, which solve the problems that code segments are difficult to generate automatically, programming language specification checking is difficult to realize accurately and the like in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

a grammar perception fuzzy test method for code test is characterized in that a deep learning mode is adopted to conduct grammar perception learning on fragile source codes to generate new code segments, and the new code segments are utilized to realize fuzzy test of codes; wherein, the fragile source code refers to the source code which has been disclosed to have security defect.

As a preferred technical scheme, the method comprises the following steps:

s1, data preprocessing: converting the collected vulnerability source codes into data types which can be identified by a deep learning model, thereby realizing the analysis processing of the vulnerability source codes;

s2, model training: training a deep learning model by using the vulnerability source code after data preprocessing as a training set by adopting a deep learning method;

s3, code generation and check: generating a new code segment by using the trained deep learning model, and then carrying out normative check on the generated new code segment;

s4, fuzzing test: and inputting the generated new code segment to an interpreter or a compiler for fuzzing test.

As a preferred technical solution, the step S1 includes the following steps:

s11, running a test code by using an interpreter, and checking whether the fragile source code meets a grammar specification;

s12, converting the fragile source code into an abstract syntax tree;

s13, replacing the variable name and the function name in the vulnerability source code with the self-defined variable name and function name;

s14, converting the complete abstract syntax tree into an abstract syntax tree sub-tree;

and S15, converting the abstract syntax tree subtree into an input which can be identified by a deep learning model.

As a preferred technical solution, in S14, a subtree of the abstract syntax tree is formed by recursively traversing the abstract syntax tree and replacing the subtree of the current node with the root node of the subtree.

As a preferred solution, in S15, the dependency relationship between the sub-trees of the abstract syntax tree depends on the order of the abstract syntax tree fragments.

As a preferred technical solution, the step S2 includes the following steps:

s21: establishing a statistical language model according to each fragment sequence, so that the statistical language model can predict the next fragment according to the context fragment; the training targets of the statistical language model are as follows: abstract syntax tree sequence for a given piece of code

According to >>

Predicting next abstractLegal tree fragment->

；

S22, defining a loss function for rewarding the fragments related to the positioning types;

the loss function is:

；

；

wherein,

representing an abstract syntax tree sequence pick>

Is selected based on the abstract syntax tree fragment of (4)/, is selected>

Represents the next abstract syntax tree segment with the greatest likelihood of being predicted, and->

Represents a normalization function, <' > is selected>

Represents the distribution probability of the next abstract syntax tree segment, based on the value of the distribution probability value>

A function of an objective measure is represented, device for selecting or keeping>

Represents->

，/>

Represents a set of the current abstract syntax tree segment and the next abstract syntax tree segment, and->

Indicates that the collection is->

Number of elements in (1), and>

represents the probability distribution of the next abstract syntax tree segment, based on the value of the probability distribution>

，/>

Represents a cross entropy loss function, <' > based on the entropy of the entropy signal>

Representing that abstract syntax tree fragments having the same type as the real abstract syntax tree fragments are prioritized;

s23, reducing during training

And &>

And, enabling the statistical language model to achieve the training goal.

As a preferable embodiment, in S22,

；

；

wherein,

number representing a segment of the abstract syntax tree, and->

Representing a total number of abstract syntax tree fragments, -a>

A probability distribution representing a next abstract syntax tree segment of the ith abstract syntax tree segment, based on the comparison of the values of the parameters in the abstract syntax tree segment>

Probability distribution of the next abstract syntax tree segment representing a prediction, -a->

Abstract syntax tree segment number, in conjunction with a predicate flag, indicating a correct type>

Represents a probability distribution of correct type, based on the number of times the next abstract syntax tree segment is predicted>

Represents the number of abstract syntax tree fragments of the same type as the real abstract syntax tree fragments, and->

Represents the first n most likely abstract syntax tree fragments, which are in the set that predicts the next abstract syntax tree fragment, and->

The representation returns a collection of abstract syntax tree fragments of the correct type.

As a preferred technical solution, the step S3 includes the following steps:

s31, randomly selecting a test case from the test case set, and randomly deleting an abstract syntax tree segment in the test case from the selected test case to obtain a deleted abstract syntax tree segment; the test case set refers to the combination of a plurality of vulnerability source codes;

s32, synthesizing an abstract syntax tree: obtaining a complete abstract syntax tree by using the deleted abstract syntax tree fragments obtained in the step S31;

and S33, carrying out syntax check on the complete abstract syntax tree, and judging whether the complete abstract syntax tree meets syntax specifications.

As a preferred technical solution, the step S32 includes the following steps:

s321, inputting the deleted abstract syntax tree segment into the trained deep learning model to obtain a candidate abstract syntax tree segment;

s322, selecting a proper abstract syntax tree segment from the candidate segments and combining the abstract syntax tree segment with the deleted abstract syntax tree segment to generate a complete abstract syntax tree; wherein, a suitable abstract syntax tree segment refers to an abstract syntax tree segment that simultaneously satisfies the following conditions: A. the predicted next abstract syntax tree segment has the highest probability in the set; B. matching the type required by the current abstract syntax tree segment;

s323, through traversing the abstract syntax tree, searching the position of adding the abstract syntax tree segment until no node can be added, thereby completing the synthesis of the abstract syntax tree; the concrete method for searching the position of adding the abstract syntax tree fragment comprises the following steps: if the current node has no child node and is not a terminal node, attaching the abstract syntax tree segment to the current node; otherwise, the searching method is called iteratively on the sub-nodes of the current node to perform the traversal in the preset sequence.

A grammar-aware fuzz testing system for code testing is used for realizing the grammar-aware fuzz testing method for code testing, and comprises the following modules which are connected in sequence:

a data preprocessing module: the method is used for converting the collected vulnerability source codes into data types which can be identified by a deep learning model, so that the vulnerability source codes are analyzed;

a model training module: the method is used for training the deep learning model by adopting a deep learning method and taking the vulnerability source code after data preprocessing as a training set;

a code generation and inspection module: generating a new code segment by using the trained deep learning model, and then carrying out normative check on the generated new code segment;

a fuzzy test module: the generated new code segment is input to an interpreter or a compiler for fuzzing test.

Compared with the prior art, the invention has the following beneficial effects:

the invention innovatively provides a method for automatically generating new code segments by using a deep learning method, which not only contains the grammatical characteristics of the original code segments, but also adds a new structure, and improves the diversity of the code segments; and secondly, by deducing the variable types, syntax errors in the newly generated code segments are solved, and the effectiveness of the newly generated codes is improved to a certain extent. Through an automatic generation and error correction mechanism, the invention greatly improves the generation efficiency of the code segment, improves the diversity of the test case and optimizes the grammar perception fuzzy test technology.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a syntax-aware fuzz testing method for code testing according to the present invention;

FIG. 2 is a schematic diagram of the detailed step S1 of the present invention;

FIG. 3 is a diagram illustrating a specific step of step S3 according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Example 1

As shown in fig. 1 to fig. 3, the present invention adopts a deep learning method to train a code segment prediction model by learning the grammatical structure of an existing code segment, so as to automatically guide the generation of a new code segment.

The invention provides correct variable reference by using a static type inference method, thereby improving the correctness of the generated code.

The starting point of the technical scheme of the invention is from the observation of the code segment triggering the security vulnerability. Summarizing the characteristics of generalized PoC by collecting PoC samples of interpreter (or compiler) exploits, it was found that more than 90% of the code fragments therein are syntactically overlapping. If sensing is carried out through the existing security defect code grammar, a code segment is directly generated, so that the grammar characteristics of the PoC code segment are reserved, and a new code segment can be generated, thereby triggering the security defect of an interpreter (or a compiler) more efficiently.

According to the method, a deep learning mode is adopted to conduct grammar perception learning on the collected vulnerability source codes, a new code segment is generated, and automatic fuzzy testing is achieved. To achieve the above object, the basic process of the present invention is shown in fig. 1.

In the first stage, the analysis processing of the source codes is mainly completed, and the collected vulnerability source codes are converted into data types which can be identified by a deep learning model. The basic process is shown in fig. 2.

The invention mainly aims at grammar learning, firstly, an interpreter is used for running a test code to check whether the grammar conforms to grammar specifications. The code is then converted into an Abstract Syntax Tree (AST) which characterizes the code by a tree structure and omits some of the code details. Therefore, different code segments have similar structures, and deep learning model identification and generation are facilitated.

The code standardization is to replace variable names, function names and the like in codes, so that sentence structures are more concerned during training, and the influence caused by the variable names is reduced. By traversing the abstract syntax tree, the information such as the user-defined variable name, function name and the like is collected, and standardized replacement is carried out in the action range of the user-defined variable and function. The variable names or functions in the code base do not need to be replaced, thereby facilitating the generation of an AST fragment containing the code base. ( And replacing the variable name and the function name in the vulnerability source code with the self-defined variable name and function name. The more specific mode is as follows: and replacing variable names, function names and the like in the codes, replacing the variable names with v + numbers, increasing the numbers along with the traversed different variables, replacing the function names with f + numbers, and increasing the numbers along with the traversed different function names. Therefore, the sentence structure is more concerned during training, and the influence brought by the variable name is reduced. )

AST fragmentation is mainly to convert a complete AST into AST subtrees with height of 1, facilitating training of deep learning models. By recursively traversing the AST, the subtree of the current node is replaced with the root node of the subtree, and an AST subtree with a height of 1 is formed. The root node of each segment is an internal node of the AST and also corresponds to a leaf node in another segment. For a given AST, one unit sub-tree is extracted from each internal node. Therefore, the number of extracted unit subtrees becomes the number of AST internal nodes.

AST sequence vectorization is primarily the conversion of AST subtrees into recognizable inputs to the deep learning model. The composition relationship between the segments is modeled as a ranking of the segments so that a deep learning model can predict the next segment to be used based on the segments that appear in front of the grammar.

The second stage is mainly to complete the training work. The training effect is influenced because the fragment set contains some fragments without significant meaning. So, before training begins, labeling is performed for some less frequently occurring fragments.

Each sequence of segments represents a file, also an input for each training. Based on each input, a statistical language model is built that enables the model to predict the next segment based on the context segment. Training a target: given a segment coded AST sequence X, the next AST fragment Y is predicted from X. Wherein,

，/>

，/>

，/>

representing the ith AST fragment of the r AST's in a segment of code. The predicted output has the most probable code segment, giving preference to segments of the same type as the true segment, but not to segments of other types.

To achieve the training goal, a new penalty function (shown below) is defined for rewarding for locating type-dependent segments, thereby continually optimizing the goal. The loss function is applied to the training set D,

。

；

；

in the above-mentioned formula,

and &>

The definition of (A) is as follows.

；

；

In the above-mentioned formula,

indicating the number of segments of the same type as the real segments. />

And &>

Respectively denotes before returning->

Individual segments and correct type segments. />

For the reward model, segments with the same type as the real segments are prioritized.

Reduce in training

And &>

The model can achieve the training goal. Eventually, not only can the correct segment for a given context be predicted, but also the segment of the same type as the correct segment in its given recommendation can be located.

The third stage mainly completes the generation and checking of the code. The basic process is shown in fig. 3.

Firstly, randomly selecting a test case from a test case set, and randomly deleting one AST fragment from the selected test case. And taking the deleted AST segment as input, and predicting according to a model trained by deep learning so as to obtain a candidate AST segment. And selecting a proper fragment from the candidate fragments by adopting a K-Top algorithm to be combined with the deleted AST fragment to generate the complete AST. The process of this combination is the reverse of the process of AST fragmentation in the first stage. By traversing the AST, the location of adding the fragment is found. If the current node does not have any children and is not an end node, then the fragment is appended to the current node. Otherwise, it will make iterative calls to itself on the children of the node to make a predetermined sequence traversal. And finally, completing AST synthesis until no node can be added.

Then, syntax check is performed on the complete AST to determine whether it meets the syntax specification. Since different AST fragments are combined, a reference error is easily caused. The invention combines the context of AST fragment generation and infers the variable type through static mode according to the using mode of the new introduced variable (if the variable is in binary operator or ternary operator, the variable type is consistent with other variable types in the expression, if the variable is in unitary operator or no operator, the variable is considered to be integer). Therefore, according to the inferred variable types, corresponding variables are declared in front of the variable scope, and variable reference errors are solved.

And the fourth stage is mainly used for completing the work of the fuzz test. And (5) finally sending the generated code segment to an interpreter (compiler) to complete a fuzzing test link.

The invention innovatively provides a method for automatically generating new code segments by using a deep learning method, which not only contains the grammatical characteristics of the original code segments, but also adds a new structure, and improves the diversity of the code segments; and secondly, by deducing the variable types, syntax errors in the newly generated code segments are solved, and the effectiveness of the newly generated codes is improved to a certain extent. Through an automatic generation and error correction mechanism, the method greatly improves the generation efficiency of the code segment, improves the diversity of the test case, and optimizes the grammar perception fuzzy test technology.

As described above, the present invention can be preferably realized.

All features disclosed in all embodiments of the present specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims

1. A grammar perception fuzzy test method for code test is characterized in that a deep learning mode is adopted to conduct grammar perception learning on fragile source codes, new code segments are generated, and the new code segments are utilized to achieve fuzzy test of the codes; the source code with vulnerability refers to the source code with security defect which has been disclosed.

2. The syntax-aware fuzz testing method for code testing according to claim 1, characterized by comprising the following steps:

s2, model training: training a deep learning model by using the vulnerability source codes subjected to data preprocessing as a training set by adopting a deep learning method;

s3, code generation and inspection: generating a new code segment by using the trained deep learning model, and then carrying out normative check on the generated new code segment;

3. The syntax-aware fuzz testing method for code testing according to claim 2, wherein the step S1 comprises the following steps:

s12, converting the fragile source code into an abstract syntax tree;

4. The syntax-aware fuzz testing method for code testing as claimed in claim 3, wherein in S14, the sub-trees of the abstract syntax tree are constructed by recursively traversing the abstract syntax tree to replace the sub-trees of the current node with the root nodes of the sub-trees.

5. The syntax aware fuzzing method for code testing according to claim 4, wherein in S15, the dependency relationship between the sub-trees of the abstract syntax tree depends on the order of the abstract syntax tree fragments.

6. The syntax-aware fuzzing method for code testing according to any one of claims 2 to 5, wherein the step S2 comprises the following steps:

s21: according to each fragment sequence, establishing a statistical language model, so that the statistical language model can predict the next fragment according to the context fragment; the training targets of the statistical language model are as follows: abstract syntax tree sequence for a given piece of code

According to >>

Predicting the next abstract syntax tree fragment->

；

the loss function is:

；

；

wherein,

representing an abstract syntax tree sequence pick>

Is selected based on the abstract syntax tree fragment of (4)/, is selected>

The next abstract syntax tree segment, representing the highest predicted likelihood, of being predicted>

Represents a normalization function, <' > is selected>

Represents->

，/>

Indicates that the collection is->

Number of middle element(s) is greater or less>

，/>

Represents a cross entropy loss function>

s23, reducing during training

And &>

And, enabling the statistical language model to achieve the training goal.

7. The syntax-aware fuzz testing method for code testing according to claim 6, wherein, in S22,

；

；

wherein,

number representing a segment of the abstract syntax tree, and->

Representing a total number of abstract syntax tree fragments, -a>

Represents the correct type of abstract syntax tree segment number, <' > or>

Represents the number of abstract syntax tree fragments of the same type as the real abstract syntax tree fragment, and/or->

Representing the first n most likely abstract syntax tree fragments in a set that predicts the next abstract syntax tree fragment, and->

8. The syntax-aware fuzz testing method for code testing according to claim 7, wherein the step S3 comprises the following steps:

s31, randomly selecting a test case from the test case set, and randomly deleting an abstract syntax tree segment in the test case from the selected test case to obtain a deleted abstract syntax tree segment; the test case set refers to the combination of a plurality of fragile source codes;

s32, synthesizing an abstract syntax tree: obtaining a complete abstract syntax tree by using the deleted abstract syntax tree segment obtained in the step S31;

and S33, carrying out syntax check on the complete abstract syntax tree, and judging whether the complete abstract syntax tree accords with syntax specifications.

9. The syntax-aware fuzzing method for code testing according to claim 8, wherein the step S32 includes the steps of:

s322, selecting a proper abstract syntax tree segment from the candidate segments and combining the abstract syntax tree segment with the deleted abstract syntax tree segment to generate a complete abstract syntax tree; wherein, the suitable abstract syntax tree fragment refers to an abstract syntax tree fragment satisfying the following conditions at the same time: A. the predicted next abstract syntax tree fragment has the highest probability in the set; B. matching with the type required by the current abstract syntax tree fragment;

s323, through traversing the abstract syntax tree, searching the position of adding abstract syntax tree fragments until no node can be added, thereby completing the synthesis of the abstract syntax tree; the concrete method for searching the position of adding the abstract syntax tree segment comprises the following steps: if the current node has no child node and is not a terminal node, attaching the abstract syntax tree segment to the current node; otherwise, the searching method is called iteratively on the subnodes of the current node to traverse in a preset sequence.

10. A grammar-aware fuzzing test system for code testing, characterized in that, the grammar-aware fuzzing test method for code testing according to any one of claims 1 to 9 is realized, which comprises the following modules connected in sequence: