CN111428247B

CN111428247B - Method for improving computer leak library

Info

Publication number: CN111428247B
Application number: CN202010326021.8A
Authority: CN
Inventors: 耿思嘉; 茅兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2023-04-04
Anticipated expiration: 2040-04-23
Also published as: CN111428247A

Abstract

The invention provides a method for improving a computer leak library, which comprises the following steps: (1) Performing instrumentation on the sample loophole program by using a GCOV compiling option of a GCC compiler and reproducing the instrumented loophole program to obtain an execution path of the program; (2) Using a Cloc and Lizard tool to obtain the code line number and the circle complexity of the bug program; (3) Comparing the difference and the connection between the computer vulnerability library LAVA-M and the sample vulnerability program through the obtained data; (4) The method comprises the following steps of improving the existing computer leakage library (LAVA-M) from two aspects of control flow and data flow to enable the existing computer leakage library to be closer to a sample; the invention carries out large-scale comparative study on the vulnerabilities of the computer vulnerability database and the sample, improves the defects of the existing computer vulnerability database, and can more comprehensively evaluate various performances of the vulnerability detection tool as a benchmark test program.

Description

Method for improving computer leak library

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method for improving a computer leak library.

Background

In order to effectively detect the vulnerability, in recent years, various scientific research workers propose different vulnerability detection technologies, which can be divided into two types, namely a dynamic analysis technology and a static analysis technology. Because the static analysis does not need to execute the target program, the result is often false positive, and because the target program is really executed, the result of the dynamic analysis is more accurate, so that the dynamic analysis has wider applicability in vulnerability detection. Among them, the fuzzy testing technique (fuzzy testing) has been widely studied and applied to the industrial and academic circles.

As a software testing technique, the fuzz testing can be classified into black box, gray box, and white box fuzz testing. The black box fuzz test has no knowledge of the internal mechanisms of the target program, and only generates new test inputs randomly and blindly, and therefore has low effectiveness. White-box testing has higher effectiveness at the expense of scalability but cannot handle large programs. The gray box test has received increasing attention in recent years by maintaining a balance between two attributes to achieve both effectiveness and scalability, and the currently most widely used gray box fuzz test is American Fuzzy Lop (AFL).

With the full development of the fuzz testing tool, the universal test for fuzzer function is: the performance of the fuzzer is judged by observing the results of the fuzzer on a set of benchmark test programs (benchmark). Many researchers have proposed various benchmarking programs at different jobs, which can be generally divided into two categories: real programs (bugs) and artificial programs (bugs). The vulnerability included in the sample program has many deficiencies in measuring the effectiveness of vulnerability detection tools: such as not being controllable; as vulnerability detection tool performance improves, the value of a single benchmark test program will decrease over time; real programs contain bugs that are more distributed, etc.

Therefore, in order to measure the performance of vulnerability detection tools and their limitations, researchers have been working on developing manual benchmarks (computer vulnerability libraries) that contain software vulnerabilities. Such computer leak libraries have been implemented by injecting bugs (LAVA-M, rode0 day) or well-designed human vulnerable programs (CGC) into existing programs. Compared with a real vulnerability program, the computer vulnerability library has the following advantages: a number of known certain existing vulnerabilities (ground truth) may be provided; the diversity of different types of vulnerabilities can be synthesized.

Despite the advantages of computer leak libraries, they also face problematic issues: these benchmarks, whether a computer library or a sample vulnerability, are used to measure the effectiveness of the vulnerability detection tools. Some vulnerability detection tools work well on computer vulnerability libraries, but they may not perform well in detecting real vulnerabilities. For example, wang found that for LAVA-M (a computer leak library), angora (fuzzy test tool based on AFL refinement) can detect almost all of its leaks, and AFL can only detect some of the leaks. But for the same loopholes for some samples, in a total of 8 trials of 24 hours each time, AFL could find 62% more loopholes than Angora (e.g. CVE-2017-6966, CVE-2018-11416, CVE 2017-13741). Therefore, researchers are hesitant to evaluate the effectiveness of vulnerability detection tools using computer vulnerability libraries. However, the existing work lacks the comparison and analysis on the characteristics of whether the computer vulnerability database can sufficiently represent the real vulnerability, and if the difference and the connection between the computer vulnerability database and the real vulnerability cannot be sufficiently known, the assessment of the vulnerability detection tool by blindly using the computer vulnerability database is unreliable, and even misleading can be generated on the accuracy of the result.

Disclosure of Invention

The purpose of the invention is as follows: the technical problem to be solved by the invention is to provide a method for improving a computer leak library aiming at the defects of the prior art, and solve the problem that the existing computer leak library is not consistent with the characteristics of the vulnerability programs of samples and cannot well measure various performances of vulnerability detection tools when being used as a benchmark test program. The invention specifically comprises the following steps:

step 1, selecting a memory error vulnerability in a national security vulnerability library NVD as a sample vulnerability, wherein the sample vulnerability forms a sample vulnerability data set, the sample vulnerability is the vulnerability in the national security vulnerability library NVD, and the improvement is to be carried out on a target vulnerability library, wherein the target vulnerability library comprises artificially constructed vulnerabilities, for example, bugs are inserted into a real program, and the bugs do not exist in reality and have great human-induced property; acquiring a source code of a sample vulnerability program and input of a trigger vulnerability, installing an operating system and software of a corresponding version, performing dynamic instrumentation and compilation on the sample vulnerability program again to obtain the code execution coverage rate and a specifically executed code, and extracting characteristics related to the sample vulnerability, namely control flow constraint characteristics and data flow constraint characteristics on a program execution path;

step 2, using Cloc and Lizard tools for the sample loophole program and a computer loophole library LAVA-M, eliminating useless information in the sample loophole program and extracting the characteristics of the program, namely the code line number and the circle complexity of the sample loophole program;

step 3, comparing the difference and the connection between a target vulnerability library and the sample vulnerability program based on the characteristics of the sample vulnerability and the characteristics of the program, wherein the target vulnerability library is a computer vulnerability library LAVA-M;

and 4, improving the LAVA-M of the computer vulnerability database according to the characteristics of the sample vulnerability on the data flow and the control flow obtained in the step 1.

Further:

in step 1, in a known public vulnerability library NVD, from 2004 to 2018, three years are taken as a stage, and nine memory error vulnerabilities are averagely selected as sample vulnerabilities, which are respectively stack overflow, heap overflow, integer overflow, null pointer dereference, format string, use after free reuse, invalid access, invalid free release, and uninitialized memory; sample vulnerabilities form a sample vulnerability data set, and the proportion of vulnerability types in the sample vulnerability data set is consistent with the overall proportion disclosed by the CVE;

processing the sample vulnerability data set, wherein the processing process specifically comprises the steps of collecting reports and information sources related to the sample vulnerability through the content on the CVE website and an external link given by the content, such as an ExploitDB; determining a target version of vulnerable software, finding a corresponding source code (or binary code), and configuring an operating system of a corresponding version for the vulnerable software, wherein the default operating system is a Linux system, and a few systems are centros; compiling and installing vulnerable software according to a collected report or a description of compiling and configuring options given in a software specification, dynamically instrumentation and compiling the vulnerable software by using a GCOV compiling option in a GCC during compiling to obtain the coverage rate of a program and a specifically executed code line, wherein the report lacks necessary dependent libraries to cause compiling failure, so that various dependent libraries required for correctly installing the vulnerable software manually are required; using proof of crash (PoC) provided in the vulnerability report as input to trigger and verify vulnerability type and error information; extracting control flow constraint characteristics and data flow constraint characteristics on a program execution path; the control flow constraint characteristic refers to the type of magic byte in the conditional constraint, and the data flow constraint characteristic refers to whether a data flow conversion data transformation exists on the program execution path for the input.

In step 1, in the aspect of control flow constraint characteristics, by using codes specifically executed on a program path obtained by recompiling by using a GCOV, recording type information and symbol information of magic byte in each conditional constraint, wherein the type information comprises char, string, float, int16, int32 and int64, the symbol information comprises symbols and unsigned symbols, recording different types of the type information and the symbol information respectively, and obtaining the percentage of each type in a corresponding program;

in the aspect of data flow constraint characteristics, binary codes included in source code information are converted into VEX intermediate codes through a binary analysis tool Valgrind, in the conversion process, conversion of data types occurring on a propagation path from all bytes in input capable of triggering bugs in the bug report in the step 1 to a final collapse point is recorded, the types of arithmetic operation related to input are judged according to extracted information for writing instructions and instructions corresponding to arithmetic operation of contents, and then the proportion of the sum of the addition, the subtraction, the multiplication, the division and the displacement to all instructions is recorded, namely the percentage of the number of bugs subjected to data conversion in a statistical sample bug data set and the input in a computer leak database data set in the corresponding data set is recorded as N.

The automatic script collects the types of magic byte in each condition constraint (mainly the condition judgment statements if, switch and loop statement while, for), and the method for extracting the magic byte type in each condition constraint is as follows: setting array str as the user's input, str [ i ] as some byte in the input, if there is such a statement in the program execution path: if (strcmp (str, magic byte) = = 0), if (str [ i ] = = magic byte), or the following code segments exist:

switch(str[i]){

case magic byte statement 1

case magic byte statement 2

…

}

Collecting magic byte type in the statement to obtain magic byte type in the sample, obtaining percentage of each type in corresponding program, and comparing difference between bug program of the sample and magic byte of computer leak library; in the aspect of data flow constraint characteristics, extracting a data conversion form on a path from an input to a program collapse point through a static dyeing analysis technology, namely extracting a form of str = str + a, str = str-a, str = str × a, str = str/a operation or < < > >, hash operation existing on the path from the input to the program collapse point,

and then comparing the data conversion characteristics in the respective execution paths of the computer vulnerability library and the real vulnerability program, and counting the percentage of the vulnerability data set of the sample and the vulnerability database data set of the computer vulnerability database, which accounts for the number of the vulnerabilities subjected to data conversion in the corresponding data set, and recording the percentage as N.

In step 1, all state features obtained on the obtained program execution path form a set G, all extracted control flow constraint features are denoted as C, data flow constraint features are denoted as D, the type and percentage set of magic bytes magic byte in the control flow constraint features are denoted as mb, the data conversion form and percentage in the data flow constraint are denoted as dt, the state feature set is denoted as G = (C, D), and the program start point is taken as the starting point, at this moment, the data conversion form and percentage in the data flow constraint are denoted as dt, the state feature set is denoted as G = (C, D), and the program start point is taken as the starting point

And each statement on the execution path is marked as s, and then C = mb @ { s }, D = dt @ { s }, until the program crashes.

In step 2, in order to compare the difference between the sample vulnerability program and the program in the existing computer vulnerability library, in addition to the characteristics of the vulnerability itself, the characteristics of the program itself containing the vulnerability also need to be compared; the method comprises the steps of obtaining characteristics of a sample loophole program and a computer leakage library by using a Cloc and Lizard tool, processing the characteristics of the program, removing useless information such as blanks, comments, brackets, members, types and statements of a name space when the characteristics of the program are obtained, obtaining the average number of the code line and the median of the circle complexity, drawing a box diagram, and comparing, analyzing, distinguishing and linking the code line and the computer leakage library.

In step 2, in addition to the characteristics of the vulnerability itself, the characteristics of the program itself containing the vulnerability need to be compared, and the following operations are respectively executed on the sample vulnerability program and the program of the computer vulnerability library: analyzing the code line number of the obtained program by using a Cloc tool, obtaining the circle complexity of the program by using a Lizard tool, and averaging the obtained code line number according to the obtained code line number;

the average value of the number of code lines of the sample loophole program is recorded as L1, and the median of the circle complexity of the sample loophole program is recorded as CCN1;

the average value of the code line number of the program of the computer vulnerability library is recorded as L2, and the median of the circle complexity of the sample vulnerability program is recorded as CCN2.

For the LAVA-M data set of the computer vulnerability library, the included software programs comprise four software programs of base64, uniq, who and md5 sum; for the sample bug program, 80 CVEs are included, including 40 software programs. These software include library files that perform certain functions, such as libYAML, etc., software with server functions, such as roftpd, etc., and programming languages, such as php, etc. The number of code lines and the degree of circle complexity of the programs are counted. Generally speaking, the smaller the number of lines of code, the smaller the program, the less deep the hidden vulnerability may be and is more easily detected by vulnerability detection tools. The large circle complexity indicates that the judgment logic of the program code is complex, the possible error of the program and the high circle complexity have great relation to the circle complexity.

The step 3 comprises the following steps: and (3) respectively comparing the sample vulnerability with a computer vulnerability library on the basis of the control flow constraint characteristic and the data flow constraint characteristic on the relevant characteristics of the sample vulnerability obtained in the step (1): including the type and percentage of magic byte, the conversion form of data (arithmetic operation, shift operation, hash operation);

and (3) comparing the median value of the circle complexity of the program and the average value of the code line number according to the characteristics of the sample vulnerability program obtained in the step (2), and respectively comparing the differences between the sample vulnerability and the computer vulnerability library. The average number of code lines for the computer leak library LAVA-M is 3093 lines, while for the sample leak library the average number of code lines is 178971 lines, 58 times more than for LAVA-M. As can be seen from the great difference between the code line number of the program in the artificial leakage library and the code line number of the program in the real world, the code line number of the program in the artificial leakage library is far smaller than the code line number of the program in the real world. The value of the complexity of the LAVA-M in the computer vulnerability library is 8.6, and the complexity of the sample vulnerability program is 8.1, so that the program characteristics of the computer vulnerability library do not conform to the sample vulnerability library. In addition, the characteristics of the program itself also affect the existence of vulnerabilities to some extent. Because it is known from experience that, often, the program may have a smaller number of code lines, and the hidden depth of the bug in the program may be shallower, compared to the larger-scale program with a larger number of code lines, the bugs with shallower hidden depths may be more easily detected by the bug detection tool, and thus it may not be better able to balance whether the bug detection tool can detect bugs with deeper hidden depths. Therefore, the computer vulnerability database may not be able to fairly and fairly measure the performance of the vulnerability detection tool as a benchmark test set.

In step 4, according to the relevant features about the sample vulnerability obtained in step 1 and the features about the sample vulnerability obtained in step 2, the existing computer vulnerability library (i.e. LAVA-M data set) is improved from two aspects of control flow constraint features and data flow constraint features, and the specific improvement method is as follows:

step 4-1, increasing the difficulty of condition judgment of control flow constraint characteristics on a program execution path, namely changing the type of magic byte in the condition constraint of the computer leakage library from the original 32-bit type into the type of magic byte in the sample obtained in the step 1, wherein the proportion of each type accords with the percentage of the sample, namely the magic bytes which are all the original int32 types are changed into the proportion of char (60%) + string (30%) + int32 (10%);

step 4-2, increasing the characteristics of the data stream on the program execution path, namely increasing the form of data stream conversion from the original form of input without data stream conversion, namely (((str ^ a) & b) < < c)/d); and the percentage of data stream conversion is in accordance with the percentage N obtained in step 1;

4-3, integrating the methods of the step 4-1 and the step 4-2, and improving the characteristics of data flow and control flow; finally, respectively detecting the improved vulnerability library and the original vulnerability library program by using a vulnerability detection tool AFL, and comparing the differences after improvement by observing the number of detected vulnerabilities before and after observation, the program coverage rate and the time for discovering vulnerabilities;

and 4-4, replacing the program of the existing computer vulnerability library with a program which is closer to the code line number and the circle complexity of the sample vulnerability program, even if the error between the average value of the code line number of the replaced program and the L1 is not higher than 1%, so that the error between the median of the circle complexity of the replaced program and the CCN1 is not higher than 1%.

Firstly, a sample loophole program is instrumented and reproduced by using a GCOV compiling option of a GCC compiler, and an execution path of the program is obtained; then, using a Cloc tool and a Lizard tool to obtain the characteristics of the looper program; comparing the difference and the connection between three mainstream computer leak libraries (LAVA-M, CGC, rode0 day) and a sample vulnerability program through the obtained data; finally, improving the existing computer leak library (LAVA-M) from two aspects of control flow and data flow to enable the LAVA-M to approach the sample more, and carrying out experimental evaluation by using a vulnerability detection tool; therefore, the purpose of improving the existing computer leak library is achieved, and the computer leak library can be used as a benchmark test program to more comprehensively evaluate various performances of the leak detection tool.

By adopting the technical scheme, the invention has the following advantages:

1. the coverage rate is wide: the invention comprehensively analyzes the computer vulnerability database and the real world vulnerability through a large amount of manual experiments, and the data set covers 587 personal vulnerabilities and 80 real vulnerabilities (CVE).

2. The universality is strong: the method aims at analyzing the memory error vulnerability with high harmfulness and wide distribution in the real world, and the analysis method can be generally applied to other vulnerabilities of any type.

3. The applicability is strong: the improved method of the invention is compatible with various programs, does not depend on any programming language, and can be applied to most programs.

Drawings

The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a schematic diagram of different comparison features of a sample vulnerability and a computer vulnerability program in the present invention.

FIG. 2 is a schematic diagram of the design of the method of the present invention.

FIG. 3 is a flowchart of the bug reoccurrence portion of FIG. 2.

Detailed Description

As shown in fig. 1, fig. 2, and fig. 3, the present invention provides a method for improving a computer vulnerability library, which includes firstly, using GCOV compilation options of a GCC compiler to peg and reproduce 80 vulnerability programs (i.e., sample vulnerabilities) selected in the real world, and obtaining an execution path of the programs; then, using a Cloc and Lizard tool to obtain the code line number and the circle complexity of the bug program; through the obtained data, the difference and the connection between three mainstream computer leak libraries (LAVA-M, CGC, rode0 day) and a sample vulnerability program are compared; finally, the existing computer leakage library (LAVA-M) is improved from two aspects of control flow and data flow, so that the existing computer leakage library (LAVA-M) is closer to the real world.

The first step is as follows: in order to collect the control flow constraints and data flow constraints on the execution path, it is necessary to know in depth how the vulnerability program is executed step by step from the input and finally triggers the crash, and therefore, it is necessary to obtain the dynamic execution trace of the vulnerability program on a certain PoC. To this end, an auxiliary vulnerability analysis framework based on a GCC compiler is proposed, namely, recompiling the programs by using a { \ tt gcov } compiling option can obtain which codes in the source codes are actually executed on the execution path of a certain program, the execution times of each line of codes and the execution time of each section of codes. Then, for the vulnerability recurrence part, a total of four stages are divided: the method comprises four stages of report collection, environment configuration, software preparation and vulnerability recurrence. It should be noted that the computer leak library, LAVA-M, rode0day, DARPA CGC already contains a software version that can make program-triggered poc and vulnerable, so the vulnerability reproduction process is simpler, and only the corresponding experimental environment needs to be installed and then reproduced. For sample bugs, the types and the causes of the bugs and various commands required for triggering are different, so that the sample bugs need to be abstracted and summarized into a specific and general experimental flow, and the flow mainly aims at the problem of sample bug reproduction. After the bug program is compiled and reproduced, the GDB debugging program is used manually to analyze the execution path of the program, and then the control flow and data flow constraints are obtained.

The second step: install the Cloc on the Linux system with the following commands: and then downloading a software program package of a corresponding version to the local according to the vulnerable software version mentioned in the vulnerability report, and directly operating the cloc under the directory to obtain the code line number of the program in the target folder. For the code line number of each program in the computer leak library in the data set, 4 programs in LAVA-M, 9 programs in Rode0day, and 93 programs in DARPA CGC were counted in total. For the program with sample bugs, the code line numbers of 40 different programs are counted, and finally the code line numbers of each independent data set are averaged to reduce errors. Then, the program's degree of circle complexity was obtained with Lizard. Lizard was installed by the pip install Lizard command. And similarly, downloading a software program package of a corresponding version to the local according to the vulnerable software version mentioned in the vulnerability report, then entering the target folder, and operating a lizard command to obtain the complexity of all files in the directory.

And step three, obtaining the data obtained in the first step and the second step, wherein the difference between the computer leak library and the sample leak is great: for the number of code lines, the number of code lines of a computer vulnerability library program is much smaller than the number of code lines of the program in a sample vulnerability, especially for DARPA CGC, by multiples of several hundred. For the circle complexity, LAVA-M and Rode0day in the computer leak library are closer to the program of the sample leak, but DARPA CGC is a program completely written by experts by hand, and the circle complexity is much smaller than the program in the sample leak, so the logic of the program is simpler than that of the sample leak. These bugs with shallow hiding depths may be more easily detected by bug detection tools than those with more code lines and more complex program logic, and thus may not be able to better scale whether bug detection tools may detect bugs that hide deeper. Therefore, the computer vulnerability database is used as a benchmark test set, and the performance of the vulnerability detection tool cannot be measured fairly and fairly. For the control flow aspect of a program, in two data sets of LAVA-M and Rode0day in a computer vulnerability library, the type and proportion of a mac byte in a condition constraint on a program execution path are far different from those of a sample vulnerability, and the type and the byte number of the mac byte influence the difficulty of a vulnerability detection tool for solving the constraint, so in the aspect, the LAVA-M and the Rode0day cannot represent the program of the sample vulnerability to fairly and fairly evaluate the capability of the vulnerability detection tool for solving the path constraint. On the other hand, since DARPA CGC is written by an expert, the design of the type of magic byte within the program can be more flexible and look more realistic. For the data flow aspect, the LAVA-M and Rode0day programs perform data flow conversion on the path with a large gap compared with the sample bug, and although the DARPA CGC performs some data conversion operations on the path, the gap still exists between the programs of the sample bug.

Fourthly, based on the characteristics of the sample vulnerability discovered in the step (3), the method improves the computer vulnerability library from two aspects of control flow and data flow to make the computer vulnerability library more approach to reality: 1) The features in the artificial vulnerability dataset are modified by the type and proportion of the magic bytes in the conditional constraint on the execution path of the sample vulnerability. Specifically, the type of the magic byte in the condition constraint of the computer leakage library is expanded from the original int32 type to char, string and int32 types, and the proportion of each type of the magic byte is relatively consistent with that in the sample vulnerability. 2) According to the characteristics of data streams in the vulnerability triggering condition on the execution path of the sample vulnerability, data transformation (data transformation) related operations are carried out on input data on the vulnerability execution path of the computer vulnerability library, namely before the input data is compared with the magic byte, and specific arithmetic four arithmetic operations and shift operations are converted. 3) And integrating the first point and the second point, and simultaneously changing the type and the proportion of the magic byte in the condition constraint on the execution path of the vulnerability in the computer vulnerability library and adding a data stream conversion related operation to the input data. After the computer vulnerability library is improved according to the method, 3 artificial vulnerability programs with different improved versions are obtained, a vulnerability detection tool AFL is respectively operated on the improved programs and the original programs, each experiment is repeated for 5 times, each time lasts for 5 hours to reduce the randomness of the fuzzy test, and the time, the number and the change condition of the detected vulnerabilities are observed. After all experiments were performed, the number of holes detected by AFL in the experiments was collected to assess the effect of improvement. The computer leak library after the improvement is found to be greatly different from the original version of the computer leak library in the result of evaluating the vulnerability detection tool.

In conclusion, the method can effectively improve the condition that the existing computer leak library does not conform to the sample leak, and more approaches to various characteristics of the sample leak, and can more fairly and fairly measure various performances of the leak detection tool as a benchmark test program.

The present invention provides a method for improving computer leak library, and the method and path for implementing the technical solution are many, and the above description is only a preferred embodiment of the present invention, it should be noted that, for those skilled in the art, a plurality of improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components not specified in this embodiment can be implemented by the prior art.

Claims

1. A method for improving a computer vulnerability library, comprising the steps of:

step 1, selecting memory error bugs in a national security vulnerability library NVD as sample bugs, forming a sample bug data set by the sample bugs, obtaining source codes of sample bug programs and input of trigger bugs, installing operating systems and software of corresponding versions, performing dynamic instrumentation compilation on the sample bug programs again, obtaining the coverage rate of code execution and specifically executed codes, and extracting characteristics related to the sample bugs, wherein the characteristics of the sample bugs refer to control flow constraint characteristics and data flow constraint characteristics on program execution paths;

step 2, using Cloc and Lizard tools for the sample vulnerability program and the computer vulnerability library LAVA-M, eliminating useless information in the sample vulnerability program and extracting characteristics of the program, wherein the characteristics of the program refer to data of code line number and circle complexity of the sample vulnerability program;

step 4, improving the LAVA-M of the computer vulnerability database according to the characteristics of the sample vulnerability on the data flow and the control flow obtained in the step 1;

in the step 1, selecting memory error bugs from known bugs to be used as sample bugs, wherein the sample bugs form a sample bug data set, and the proportion of bug types in the sample bug data set is consistent with the overall proportion disclosed by the CVE;

processing the sample vulnerability data set, wherein the processing process specifically comprises the following steps: collecting reports and information sources related to sample vulnerabilities through the content on the CVE website and external links given by the content; determining a target version of the vulnerable software, finding a corresponding source code, and configuring an operating system of a corresponding version for the vulnerable software, wherein the default operating system is a Linux system; compiling and installing vulnerable software according to a collected report or a description of compiling and configuration options given in a software specification, and during compiling, dynamically performing instrumentation and compiling on the vulnerable software by using a GCOV compiling option in GCC to obtain the coverage rate of a program and a specifically executed code line;

triggering and verifying vulnerability types and error information using crash proofs provided in the vulnerability report as input;

extracting control flow constraint characteristics and data flow constraint characteristics on a program execution path; the control flow constraint characteristic refers to the type of magic byte in the condition constraint, and the data flow constraint characteristic refers to whether data flow conversion data transformation exists for input on a program execution path or not;

in step 1, in terms of control flow constraint characteristics, recording type information and symbol information of magic byte in each conditional constraint by using codes specifically executed on a program path obtained by recompilation of GCOV, wherein the type information comprises char, string, float, int16, int32 and int64, the symbol information comprises symbols and unsigned symbols, recording different types of the type information and the symbol information respectively, and obtaining the percentage of each type in a corresponding program;

in the aspect of data flow constraint characteristics, binary codes included in source code information are converted into VEX intermediate codes through a binary analysis tool Valgrind, in the conversion process, the conversion of data types occurring on a propagation path from all bytes in input capable of triggering bugs in the bug report in the step 1 to a final collapse point is recorded, the types of arithmetic operation related to input are judged according to extracted information for writing instructions and instructions corresponding to arithmetic operation of contents, the types include addition, subtraction, multiplication, division and shift, then the proportion of the sum of the addition, the subtraction, the multiplication, the division and the shift in all instructions is recorded, and the percentage of the number of bugs subjected to data conversion input in a sample bug database data set and a computer bug database data set in the corresponding data set is counted and recorded as N;

in step 1, all state features obtained on the obtained program execution path form a set G, all extracted control flow constraint features are recorded as C, data flow constraint features are recorded as D, the type and percentage set of magic bytes magic byte in the control flow constraint features are represented by mb, the data conversion form and percentage in the data flow constraint are represented by dt, the state feature set is represented as G = (C, D), and the program start point is taken as a starting point, at this time, the state features form a set G, all extracted control flow constraint features are recorded as C, all data flow constraint features are recorded as D, the type and percentage set of magic bytes magic byte in the control flow constraint features are represented by mb, the data conversion form and percentage in the data flow constraint set are represented by dt, the state feature set is represented as G = (C, D), and the program start point is taken as a starting point, and the program start point is taken as a program start point, and the program is taken as a program

If each statement on the execution path is marked as s, C = mb @ { s }, D = dt @ { s }, and the program is crashed;

in step 2, the following operations are respectively executed on the sample vulnerability program and the computer vulnerability library program: analyzing the code line number of the obtained program by using a Cloc tool, obtaining the circle complexity of the program by using a Lizard tool, and averaging the obtained code line number according to the obtained code line number;

the average value of the code line number of the program of the computer vulnerability library is recorded as L2, and the median of the circle complexity of the sample vulnerability program is recorded as CCN2;

the step 3 comprises the following steps: and (3) respectively comparing the sample vulnerability with a computer vulnerability library on the basis of the control flow constraint characteristic and the data flow constraint characteristic on the relevant characteristics of the sample vulnerability obtained in the step (1): the method comprises the types and percentages of magic byte and the conversion form of data;

for the characteristics of the sample vulnerability program obtained in the step 2, comparing the median value of the circle complexity of the program with the average value of the code line number, and respectively comparing the differences of the sample vulnerability and the computer vulnerability library;

in step 4, according to the relevant features about the sample vulnerability obtained in step 1 and the features about the sample vulnerability obtained in step 2, the existing computer vulnerability library is improved from two aspects of control flow constraint features and data flow constraint features, and the specific improvement method is as follows:

step 4-1, increasing the difficulty of condition judgment of control flow constraint characteristics on a program execution path, changing the type of magic byte in the condition constraint of the computer vulnerability library from the original type of only 32-bit into the type of magic byte in the sample obtained in the step 1, wherein the proportion of each type accords with the percentage of the sample;

step 4-2, increasing the characteristics of the data stream on the program execution path, increasing the form of data stream conversion from the original form of input without data stream conversion, and enabling the percentage of data stream conversion to be in accordance with the percentage N acquired in the step 1;

4-3, integrating the methods of the step 4-1 and the step 4-2, and improving the characteristics of data flow and control flow; finally, respectively detecting the improved leak library and the original leak library program by using a leak detection tool, finding the time of the leak by observing the number of the detected leaks before and after observation and the program coverage rate, and comparing the differences after improvement;

and 4-4, replacing the program of the existing computer vulnerability library with a program which is closer to the code line number and the circle complexity of the sample vulnerability program, and enabling the error between the average value of the code line number of the replaced program and the L1 to be not higher than 1% and the error between the median of the circle complexity of the replaced program and the CCN1 to be not higher than 1%.