CN109597767B

CN109597767B - Genetic variation-based fuzzy test case generation method and system

Info

Publication number: CN109597767B
Application number: CN201811554639.9A
Authority: CN
Inventors: 卢凯; 周旭; 何兴陆; 张文喆; 王睿伯; 王鹏飞
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2021-11-12
Anticipated expiration: 2038-12-19
Also published as: CN109597767A

Abstract

The invention discloses a genetic variation-based fuzzy test case generation method and a system, wherein the method comprises the steps of selecting two seed test cases, and for the data position of a new test case, if the data of the two seeds are the same, the data is inherited to the current data position of the new test case, if the data of the two seeds are different and randomly belong to a character string comparison set extracted by static analysis of a target binary file, the data is randomly mutated into the data in a preset character string comparison set, and otherwise, the data of one seed is randomly selected to be inherited to the current data position of the new test case. The invention inherits the advantages of the test case generation method based on generation and the test case generation method based on variation, simultaneously avoids the corresponding disadvantages of the test case generation method and the test case generation method, can realize the core code of a large-scale fuzzing target program without manual operation, and has the advantages of higher possibility of improving the path coverage rate and easy triggering crash when generating the test case.

Description

Genetic variation-based fuzzy test case generation method and system

Technical Field

The invention relates to the field of vulnerability discovery in the field of computers, in particular to a fuzzy test case generation method and a fuzzy test case generation system based on genetic variation, which are used for providing a vulnerability discovery fuzzy test case for a target program of vulnerability discovery.

Background

Test case generation methods are roughly classified into two types, namely, test case generation based on generation and test case generation based on mutation. The existing test case generation method based on generation is to write a test case generation rule manually, so that a test case can be generated according to a target rule, and the generated test case can bypass an error check code of a target program, so that a core function code of the target program is fuzzing (an automatic software test technology based on defect injection); however, the test case generation method requires a large amount of manual intervention, which results in an excessive labor cost, and meanwhile, the rules of different target programs are different, which results in a poor expandability of the test case generation method based on generation, and is not suitable for fuzzing a large number of different target programs. The existing test case generation method based on variation generates a new test case by randomly varying the existing normal input, so that the generated test case can utilize some information in the existing normal input to bypass error check codes, can directly run without manual operation, and can achieve the effect of higher expansibility only by replacing the normal input aiming at different programs; however, the test case generated by the test case generation method can only reach some codes near the codes from fuzzing to normal test cases, and relatively distant codes or codes with harsh entry conditions are difficult to reach, so that the generated test case is difficult to fuzze all codes of the target program.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a fuzzy test case generation method and a fuzzy test case generation system based on genetic variation, which inherits the respective advantages of the test case generation method based on generation and the test case generation method based on variation, simultaneously avoids the corresponding disadvantages of the test case generation method and the test case generation method, can realize the core code of a large-scale fuzzy target program without manual operation, and has the advantages of higher possibility of improving the path coverage rate and easy triggering of crash when generating the test case.

In order to solve the technical problems, the invention adopts the technical scheme that:

a fuzzy test case generation method based on genetic variation comprises the following implementation steps:

1) selecting two seed test cases;

2) selecting a data position as a current data position according to the seed length of the new test case;

3) judging whether the data of the two seed test cases are the same or not according to the current data position, and if so, skipping to execute the step 4); otherwise, skipping to execute the step 5);

4) the data of the current data position of the seed test case is inherited to the current data position of the new test case;

5) judging whether any one of the data of the current data positions of the two seed test cases belongs to a preset character string comparison set, wherein the preset character string comparison set is obtained by performing static analysis on a target binary file for executing the test case and extracting character string data in the target binary file, and if yes, randomly mutating the data of the current data position of the new test case into the data in the preset character string comparison set, and skipping to execute the step 7); otherwise, skipping to execute the step 6);

6) randomly selecting data of the current data position of one seed test case to be inherited to the current data position of a new test case;

7) judging whether the seed length is traversed or not, if not, continuously selecting the next data position as the current data position, and skipping to execute the step 3); otherwise, skipping to execute the next step;

8) and carrying out random mutation on the new test sample according to a specified proportion, and outputting the new test sample after the random mutation is finished as the finally obtained new test sample.

Optionally, the character string comparison set preset in step 5) is obtained by performing static analysis on a target binary file for executing the test case and extracting character string data therein.

Optionally, the character string comparison set is a first subset PAC, a second subset PSC, and a third subset CSP which are obtained by dividing according to the density of the locations of the character string data in the target binary file for executing the test case, the density of the first subset PAC is higher than the densities of the second subset PSC and the third subset CSP, and the densities of the second subset PSC and the third subset CSP are the same.

Optionally, the detailed steps of step 5) include:

5.1) judging whether any one of the data of the current data positions of the two seed test cases belongs to a first subset PAC in the character string comparison set or not, if so, randomly mutating the data of the current data position of the new test case into the data of the first subset PAC in the preset character string comparison set according to a first probability, otherwise, randomly mutating the data into second subset PSC or third subset CSP random data, and skipping to execute the step 7); otherwise, skipping to execute the step 5.2);

5.2) judging whether any one of the data of the current data positions of the two seed test cases belongs to a second subset PSC in the character string comparison set or not, if so, randomly mutating the data of the current data position of the new test case into the data of the second subset PSC in the preset character string comparison set according to a second probability, otherwise, randomly mutating the data into random data, and skipping to execute the step 7); otherwise, skipping to execute the step 5.3);

5.3) judging whether any one of the data of the current data positions of the two seed test cases belongs to the third subset CSP in the character string comparison set or not, if so, preferentially and randomly mutating the data of the current data position of the new test case into the data of the third subset CSP in the preset character string comparison set according to a second probability, otherwise, randomly mutating the data into random data, and skipping to execute the step 7); otherwise, the jump is performed to step 6).

Optionally, the first probability is greater than the second probability.

Optionally, the extracting and generating step of the character string comparison set includes:

s1) extracting all character string data in a target binary file for executing the test case, recording the positions of the character string data, dividing the character strings into different character string groups according to the density of the positions of the character strings, wherein the character string groups are called P, each character string group is called P, and P belongs to P;

s2) acquiring character string comparison information in a target binary file, recording a comparison code position when the character string data are recorded for use, dividing the character string into different character string groups according to the density of the character string comparison code position, wherein the character string group set is called C, each character string group is called C, and C belongs to C;

s3) anding the string sets in the two types of string sets to obtain a union of p and c, where such set is called a first subset PAC, and each string set is called PAC, PAC ∈ PC, PAC = p ∞;

s4) performing a difference operation on the string groups in the two types of string sets to obtain a difference set of p and c and a difference set of c and p, wherein such string sets are called a second subset PSC and a third subset CSP, each string group is called PSC and CSP, PSC belongs to PSC, CSP belongs to CSP, PSC = p-c, CSP = c-p, respectively;

s5) removing the character string group with only one element in the second subset PSC and the third subset CSP from the two sets, and combining all the individual character strings into a new fourth subset S, to finally obtain a character string comparison set composed of the first subset PAC, the second subset PSC, the third subset CSP and the fourth subset S.

Optionally, the step 1) of selecting two seed test cases specifically refers to selecting from a seed set, and the detailed step of generating the seed set includes:

1.1) collecting seed test cases used as training sets;

1.2) putting the seed test cases into a target binary file for execution, and acquiring path coverage information in the execution of each seed test case;

1.3) randomly combining two seed test cases with path coverage information of PS1 and PS2 to generate a new combined test case, putting the new combined test case into a target binary file for execution, and acquiring path coverage information PN of the new combined test case;

1.4) taking the path coverage information of the two seed test cases as PS1 and PS2 as input, taking the difference between the path coverage information PN and the path coverage information of the new combined test case corresponding to the two seed test cases as PS1 and PS2 as a basis for classification, constructing a training set by using the collected seed test cases to complete a training machine learning model, and adding the new test case and the path coverage information thereof into a seed set when the difference between the path coverage information PN and the path coverage information of the new combined test case is PS1 and PS2 is greater than a threshold value;

1.5) inputting two seed test cases with arbitrary path coverage information of PS1 and PS2 into a training machine learning model for classification, selecting two seed test cases corresponding to the new combined test case of the two seed test cases and two seeds with the path coverage information of the type with the largest difference of PS1 and PS2 to combine into a new test case, putting the new test case into a target binary file for execution, acquiring the path coverage information PN of the new combined test case, judging whether the path coverage rate is increased within a specified time length, if so, jumping to execute the step 1.4), otherwise, judging that the generation of the seed set is finished and exiting.

The invention also provides a genetic variation-based fuzzy test case generation system which comprises computer equipment and computer equipment, wherein the computer equipment is programmed to execute the steps of the genetic variation-based fuzzy test case generation method.

Compared with the prior art, the invention has the following advantages:

1. the preset character string comparison set is obtained by performing static analysis on a target binary file for executing a test case and extracting character string data in the target binary file, and partial information in a program binary code is obtained by statically analyzing the binary code of a target program, so that related information used for character string comparison in the program can be obtained. This information is useful in vulnerability mining and is often used in error checking. If the test case generation is purely random, the information is difficult to match randomly. For example, in a browser program, an analysis module analyzes an html file according to a tag in the html file, the html file needs to be created strictly according to a character string corresponding to the tag, otherwise, the created test case is processed for an error text by the analysis module, and thus a core function module of the browser program cannot be mined. Therefore, the information is obtained through static analysis, and the method has great significance for the aspect of subsequently guiding the generation of the test case.

2. The method for processing the seed pair (two seed test cases) is also the core content of the invention, and the core thought of the method is derived from the gene recombination thought in genetics. As in genetics, the same genes that are common to the father and mother are more likely to be more important genes, and should be inherited rather than easily changed. This process simulates inheritance and the like in genetics. And then judging whether the data of different positions belong to the character string comparison class, and if so, replacing the data in the position with a random one in the character string comparison class. This process simulates the targeted alteration of genes in genetic engineering. If the data does not belong to the character string comparison class, the data in one seed is selected to be inherited. This process mimics the difference in the displayed traits of dominant and stealth genes in genetics. And finally, randomly mutating the test cases to improve the coverage rate of the target program codes. This process mimics genetic mutations in genetics. Therefore, the fuzzy test case generation method based on genetic variation inherits the advantages of the test case generation method based on generation and the test case generation method based on variation, simultaneously avoids the corresponding defects of the test case generation method based on generation and the test case generation method based on variation, and can realize the core code of a large-scale fuzzing target program without manual operation.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of the principle of genetic variation in the method of the embodiment of the present invention.

Detailed Description

As shown in fig. 1, the implementation steps of the fuzzy test case generation method based on genetic variation in this embodiment include:

1) selecting two seed test cases;

In this embodiment, the preset character string comparison set in step 5) is obtained by performing static analysis on a target binary file for executing a test case and extracting character string data therein. And performing static analysis on the target binary file to obtain the relevant information of the target binary file, assisting the generation of a subsequent test case, and optimizing the problem of what value the target binary file is mutated in the mutation process.

In this embodiment, the character string comparison set is a first subset PAC, a second subset PSC, and a third subset CSP which are obtained by dividing according to the density of the positions of the character string data in the target binary file for executing the test case, the density of the first subset PAC is higher than the densities of the second subset PSC and the third subset CSP, and the densities of the second subset PSC and the third subset CSP are the same.

In this embodiment, the detailed steps of step 5) include:

In this embodiment, the first probability is greater than the second probability.

In this embodiment, the step of extracting and generating the character string comparison set includes:

Through the detailed steps of the extraction and generation steps of the character string comparison set, 4 types of sets PAC, PSC, CSP and S are obtained, the sets are divided into different priority levels, the first subset PAC is a first level, the second subset PSC and the third subset CSP are a second level, the fourth subset S is a third level, and after set classification, operations with different priorities can be carried out in subsequent seed mutation according to different levels when test cases are generated.

In this embodiment, the step 1) of selecting two seed test cases specifically refers to selecting from a seed set, a large amount of rule format information related to a target program can be obtained from seeds through a seed set with rich types, and the test case generation method can be guided to generate more effective test cases through reasonable use of the rule information.

In this embodiment, the manner of generating the seed set is to perform iterative learning by using a machine learning model, and by means of machine learning, the rationality in seed selection is further enhanced, and the possibility that a new path is found by a newly generated seed can be further improved. In this embodiment, the detailed step of generating the seed set includes:

1.1) collecting seed test cases used as training sets;

The step of generating the seed set is implemented by a fuzzing tool.

Through the continuous reinforcement learning of the generated seed set, the machine learning capability is gradually enhanced along with the increase of the operation times, and the path coverage rate of the test case generated by the seeds selected by the machine learning model is greatly improved.

In this embodiment, when collecting seed test cases used as a training set in step 1.1), collecting a large number of legal seeds should include: (1) common normal test cases, such as video processing programs, should be collected for some common videos, dramas, mv, and so on; (2) some generated test cases, for example, using a video processing program as an example, should generate various videos using a video generation program. Furthermore, the variety of seeds collected should be as rich as possible, for example, using video processing program as an example, the types of various video format files should be included, such as mp4, rmvb, avi, wma, rm, mpeg, mov, mkv, flv, f4v, m4v, 3gp, dat, ts, mts, vob, etc.

As shown in fig. 2, a portion a represents a comparison set of strings derived from a static analysis of the target program, and the comparison set includes a first subset PAC, a second subset PSC, and a third subset CSP. In the two seed test cases of the first seed and the second seed, different operations are subsequently performed on different conditions at the same position of data and at different positions of data, wherein the positions of the unmarked letter areas of the first seed and the second seed in the figure represent that the seed data at the positions are the same, and the positions of the other marked letter areas represent that the seed data are different.

The position of the unmarked letter region in fig. 2 represents that the data in the seed is the same, which indicates that the data at this position is relatively stable, and the data representing this position is more likely to be regular format data and should not be easily changed. As in genetics, the same genes that are common to the father and mother are more likely to be more important genes, and should be inherited rather than easily changed. The process simulates operations such as inheritance in genetics; the data in the representative seeds at the positions marked with the letter areas in the graph are different, the conditions of the data are judged, and whether the data at the different positions belong to the red character string comparison class in the graph or not is judged. And then according to different data types, different operations are carried out on different position data when a new test case is generated.

The data of the part marked with the letter B in the graph 2 represents that the data at the same position in the two seeds belong to a character string comparison class, at the moment, a character string can be randomly selected from the character string comparison class to be put into a newly generated test case, because the data at the position are mostly the data of the character string comparison class, the character string comparison operation is randomly selected in the set, a new unknown path can be found, and the unknown path is a core code region which is obtained by the fact that the two seeds reach the new position, and the process simulates the directional modification of genes in genetic engineering;

the portions marked with letters D and C in fig. 2 respectively represent that the data at the same position in the two seeds do not belong to the string comparison class, when one of the yellow or green portions is randomly selected for inheritance, this process simulates the difference in the displayed traits of the dominant and invisible genes in genetics; the generated test cases are subjected to random mutation according to a certain proportion, the process simulates gene mutation in genetics, and then the generated new test cases are positioned in the latest test cases generated by the method, and the path coverage rate in the fuzzy test can be directionally improved.

Since the fuzzy test case generation method based on genetic variation is a genetic test case generation method, after a large number of legal seeds with abundant types are collected, the abundant and high-quality seeds can help the test case generation method to more efficiently inherit excellent data of various types of seeds, and great help is provided for improving code coverage rate of a fuzzing tool and helping the test case to bypass formatting inspection at the early stage of a program to execute more program core codes; in the fuzzy test case generation method based on genetic variation, the seeds are paired pairwise, and different seeds have different execution paths, so that the selection mode of the seeds is also very important.

In summary, in the fuzzy test case generation method based on genetic variation of this embodiment, two known test cases are inherited at the same position by a genetic variation method in genetics, and operations of selective inheritance or random variation are performed on different pairs of positions of the two test cases, so that a portion of the two test cases, which is more efficient and more stable with respect to a fuzzing program, is inherited to a subsequent generation, and a portion, which is relatively inefficient and changeable, is subjected to probabilistic variation, thereby greatly improving the effectiveness of the test cases. In addition, the present embodiment further provides a genetic variation-based fuzz test case generation system, which includes a computer device programmed to execute the steps of the genetic variation-based fuzz test case generation method according to the present embodiment.

The foregoing is considered as illustrative of the preferred embodiments of the invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Those skilled in the art can make numerous possible variations and modifications to the present invention, or modify equivalent embodiments to equivalent variations, without departing from the scope of the invention, using the teachings disclosed above. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical spirit of the present invention should fall within the protection scope of the technical scheme of the present invention, unless the technical spirit of the present invention departs from the content of the technical scheme of the present invention.

Claims

1. A fuzzy test case generation method based on genetic variation is characterized by comprising the following implementation steps:

1) selecting two seed test cases;

5) judging whether any one of the data of the current data positions of the two seed test cases belongs to a preset character string comparison set, wherein the preset character string comparison set is obtained by performing static analysis on a target binary file for executing the test case and extracting character string data in the target binary file, and if yes, randomly mutating the data of the current data position of the new test case into the data in the preset character string comparison set, and skipping to execute the step 7); otherwise, skipping to execute the step 6); the character string comparison set comprises a first subset PAC, a second subset PSC and a third subset CSP which are obtained by dividing according to the density of the positions of the character string data in a target binary file for executing the test case, wherein the density of the first subset PAC is higher than that of the second subset PSC and the third subset CSP, and the densities of the second subset PSC and the third subset CSP are the same;

2. The method according to claim 1, wherein the comparison set of strings preset in step 5) is obtained by performing static analysis on a target binary file for executing the test case to extract string data therein.

3. The genetic variation-based fuzzy test case generation method according to claim 1, wherein the detailed step of step 5) comprises:

4. The genetic variation-based fuzz test case generation method according to claim 3, wherein the first probability is greater than the second probability.

5. The genetic variation-based fuzzy test case generation method according to claim 1, wherein said step of extracting and generating said comparison set of character strings comprises:

6. The genetic variation-based fuzzy test case generation method according to claim 1, wherein the step 1) of selecting two seed test cases specifically means selecting from a seed set, and the detailed step of generating the seed set comprises:

1.1) collecting seed test cases used as training sets;

7. A genetic variation-based fuzz test case generation system, comprising a computer device, wherein the computer device is programmed to execute the steps of the genetic variation-based fuzz test case generation method according to any one of claims 1 to 6.