CN116383826A

CN116383826A - Binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning

Info

Publication number: CN116383826A
Application number: CN202310302345.1A
Authority: CN
Inventors: 王栓奇; 刘钊; 武伟; 谢晚冬; 李之博; 王宇龙; 王梦阳; 盛珂
Original assignee: Information Central Of China North Industries Group Corp
Current assignee: Information Central Of China North Industries Group Corp
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-07-04

Abstract

The invention discloses a binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning, which comprises the following steps: the binary program input sample of the fuzzy test is used as an environment state and is input into a deep reinforcement learning model, and the environment state represents the corresponding state of the input sample by a byte array method; converting the current sample state into a variant sample state through a variant strategy, and selecting the input sample state of the next time step based on the binary program sample execution path information; calculating the state of the variation sample based on the coverage rate index to obtain feedback rewards; judging whether the mutation strategy is effective or not according to the feedback rewards, optimizing mutation strategy selection, and realizing dynamic optimization of the binary vulnerability discovery process. The invention improves the quality of a mutation generation sample, thereby improving the working efficiency of the binary code fuzzy test, effectively finding and exposing software vulnerabilities, and remarkably improving the quality and the safety level of the software.

Description

Binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of binary code security, in particular to a binary vulnerability-mining process dynamic optimization method based on deep reinforcement learning.

Background

With the increasing perfection of computer system functions, the system composition is more and more complex, the scale is also more and more large, the software scale is unprecedented, and the exposed software safety problem is also more and more. Software bugs are a non-negligible problem, which may be due to logic flaws left behind by programmers without thought when writing applications. Any implicit imperfection or human error can cause significant loss to individuals and society, as there are often people who exploit these programmatic vulnerabilities to attack the destruction system.

Whether software has vulnerabilities is a determining factor that affects the security of the information system. Although it is impossible to completely avoid the occurrence of the bug in the whole life cycle of the software system, if the bug can be detected and identified in time, then the security repair patch is released, and the security of the computer software can be ensured to a great extent. The method for accurately detecting and identifying the software bug, further analyzing the risk degree, the generation reason and the utilized mode of the bug, and releasing the safety repair patch in time is a significant activity in the computer software industry.

Vulnerability discovery is a common defense method, and aims to detect vulnerabilities hidden in programs as much as possible quickly and repair them in time. Software vulnerability discovery techniques can be broadly divided into two categories, depending on the object being discovered: the first type is source code vulnerability mining, namely vulnerability mining is carried out on files of open source codes; the other type is binary code vulnerability discovery, namely vulnerability discovery for closed source software. Most software manufacturers do not open source codes of products and only can acquire binary programs for protecting commercial interests and intellectual property rights, so that the vulnerability mining of the binary codes has wide universality, significance and practical value.

Binary code bug detection presents problems compared to source code bug detection, including difficulty in directly extracting program information, bulkiness of assembly code, difficulty in analysis, cumbersome workload, and difficulty and time-consuming detection. With the development of artificial intelligence, machine learning has been developed by training a model using existing data and then using the model to make predictions, and many researchers have applied this technique for vulnerability analysis and mining. Compared with the traditional method, the vulnerability detection model using machine learning can process a large-scale data set, so that the detection speed is improved, and the detection cost is reduced. Meanwhile, the machine learning can liberate manual work due to the characteristic of automatic learning. At present, research on the software vulnerability discovery based on machine learning is still immature, and the existing software vulnerability discovery method based on machine learning often has higher false alarm rate and false alarm rate due to lack of a standard data set and incapability of extracting an effective feature set.

Therefore, it has great significance and prospect to study how to utilize machine learning to conduct software vulnerability discovery and improve the accuracy of vulnerability discovery.

Disclosure of Invention

In view of the above, the invention provides a binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning, which realizes dynamic optimization of a mutation strategy and improves the binary code vulnerability accurate positioning efficiency.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning comprises the following steps:

the binary program input sample of the fuzzy test is used as an environment state and is input into a deep reinforcement learning model, and the environment state represents the corresponding state of the input sample by a byte array method;

converting the current sample state into a variant sample state through a variant strategy, and selecting the input sample state of the next time step based on the binary program sample execution path information;

calculating the state of the variation sample based on the coverage rate index to obtain feedback rewards;

judging whether the mutation strategy is effective or not according to the feedback rewards, optimizing mutation strategy selection, and realizing dynamic optimization of the binary vulnerability discovery process.

Preferably, converting the current state into the mutated sample state by the mutation policy specifically includes:

according to the current sample state s _t Obtaining a mutation action a from a mutation action space according to a strategy function pi screening _t The following formula is shown:

a _t ＝π(s _t )

according to the current sample state s of the input by the mutation action _t Performing mutation treatment to obtain a mutation sample state s _t ' as shown in the following formula:

s _t ′＝Mutate(s _t ,a _t )

wherein, the mutation () is a mutation function.

Preferably, selecting the input sample state of the next time step based on the binary program sample execution path information specifically includes:

setting sample queue Q _s And a set P of all path information performed by the existing samples;

if the current sample state at time step t is s _t Variant sample state s _t ' and a new sample execution path p is generated during execution _t Then sample execution path p _t Add to queue Q _s And update set P, and will s _t ' input sample state as next time step; if no new sample execution path p is generated during execution _t Then from the effective sample queue Q _s Randomly selected samples as input sample state for next time stepAs shown in the following formula:

where random_choose () is a Random selection function, and P represents a set of all path information that has been performed by the existing samples.

Preferably, the feedback rewards are calculated for the variant sample state based on the coverage rate index, and specifically include:

calculating and recording the variant sample state s _t ' corresponding execution path information, resulting in record set M _t ＝Execute(s _t ′)；

Wherein execution () is an execution path recording function, M _t Is the record set of execution path information, M _t M of each element m _ij Representing the slave basic block b _i To basic block b _j The execution times of the jump edge;

judging m _i,j Whether or not is greater than 0, if m _i,j >0, then indicating that the jump edge has been performed at least once, the subset of records satisfying this condition is denoted as M _t ' as shown in the following formula:

′

M _t ＝{m _i,j |m _i,j >0,m _i,j ∈M _t }

according to record subset M _t ' sum record set M _t The proportion of the skip edges of the current sample to all skip edges of the target program is calculated when the current sample is executed and is used as feedback rewards of the current sample, and the specific calculation formula is as follows:

wherein R is _t Representing the proportion of the jump edges to all the jump edges of the object program, size () represents the number of set elements.

Compared with the prior art, the invention discloses a dynamic optimization method for a binary vulnerability discovery process based on deep reinforcement learning, which improves the traditional fuzzy test process based on a deep reinforcement learning model, can reduce randomness and blindness in the variation process of an input sample, increases the generation probability of an effective sample, and improves the quality of the variation generation sample, thereby improving the working efficiency of binary code fuzzy test, effectively finding and exposing software vulnerabilities, and remarkably improving the quality and safety level of software.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a dynamic optimization method of a binary vulnerability discovery process based on deep reinforcement learning.

FIG. 2 is a block diagram of a dynamic optimization method of a binary vulnerability discovery process based on deep reinforcement learning.

FIG. 3 is a diagram showing dynamic optimization of a sample mutation strategy according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning, which comprises the following steps as shown in fig. 1 and 2:

In this embodiment, for characterizing an input sample, there are a variety of design modes, including constructing an input set space from all substrings in the sample, constructing an input set space from sample bit-format data, and so on. From the standpoint of data mutation granularity, mutation granularity of the character strings is larger, and mutation granularity of the bits is smaller, which is not beneficial to effective mutation of the data. The invention uses byte array method to represent the corresponding state s of the input sample data D, in order to maximize the probability of finding new path, a subset of all element sets of the binary program input sample is set as null, namely

Meanwhile, in order to better utilize the existing data variation history experience, the system sets an effective sample queue Q according to the data variation history experience _s And all path information that has been performed by the existing samples is set as a set P. If at time step t the sample state is s _t The mutated sample state is s _t ' and a new execution path p is generated during execution _t Then add it to queue Q _s And update set P, and will s _t ' as a status input for the next time step; otherwise from the active sample queue Q _s As a state input for the next time step, as shown in the following equation:

where random_choose () is a randomly selected function.

In this embodiment, the core of the fuzzy test procedure is to perform data mutation on the sample to obtain a new sample capable of triggering the abnormal state of the target program. In view of comprehensive consideration of performance and efficiency, the method provided by the invention generalizes and summarizes common data mutation methods as a mutation action space A. The specific table is shown below:

TABLE 1 mutation action space

The reinforcement learning model is based on the current sample state s _t Filtering and obtaining a mutation action a from a mutation action space according to a strategy function pi () _t The following formula is shown:

a _t ＝π(s _t )

where pi () is a policy function, s _t Is the current sample state, a _t Is a mutation action.

According to the selected action, the current input data state s _t Performing mutation processing to fully explore the environment state space and the mutation action space and obtain the corresponding state s of the mutated sample with higher path coverage rate _t ' as shown in the following formula,

s _t ′＝Mutate(s _t ,a _t )

where Mutate () is a mutation function.

In this embodiment, in the conventional fuzzy test process, the test result is determined by whether to trigger a program potential bug, and the expression form is usually whether to monitor whether the target program to be tested enters an abnormal state such as crash or suspension. However, the fuzzy test triggers binary program exception, which is a long and time-consuming process, and if only this is used as a feedback signal of the environment, it is difficult to adjust the mutation strategy in time, so that time and resources are wasted on invalid sample mutation.

In order to solve the problems, the coverage rate index is used for measuring the program execution area covered by the current sample. Samples with larger coverage rate can fully explore the code execution space of the target program, so that abnormal logic of program execution is triggered with higher probability, namely, potential dangerous loopholes of the target program are triggered. Common coverage indicators include, for example, a dedicated coverage, a row coverage, a basic block coverage, a branch coverage, a conditional coverage, an edge coverage, and the like. By preprocessing the target program such as pile insertion, the corresponding coverage rate of the sample can be obtained immediately after the execution of the target program is finished, the value is obviously changed in the continuous fuzzy test process, and the good and bad value of the current sample can be fed back in time, so that the scheduling program can adjust the following mutation strategy selection accordingly.

Compared with other coverage indexes, the coverage rate of the edges can provide relatively more path information, and the coverage rate of the edges is selected as a feedback rewards calculation method.

Calculating a variant sample state s _t ' the corresponding execution path information of this time is recorded into the shared memory and is recorded as M _t ＝Execute(s _t '), each element m in the record _ij Representing a block from a base block b _i To another basic block b _j The number of execution times of the jump edge. If m is _ij >0, then indicating that the jump edge has been performed at least once, the subset of records satisfying the condition is denoted as M _t ' as shown in the following formula:

′

M _t ＝{m _i,j |m _i,j >0,m _i,j ∈M _t }

wherein m is _i,j Representing basic block b _i To b _j The number of execution times of the jump edge.

The feedback rewards are calculated as follows, namely the proportion of the jump edges of the current sample to all jump edges of the target program when the current sample is executed is as follows:

wherein R is _t Representing hopsThe size () represents the number of collection elements, which is the proportion of the edge that the object takes to all the jumps.

In this embodiment, determining whether the mutation policy is valid according to the feedback prize size specifically includes: and directly judging whether the mutation strategy is effective or not through the feedback rewards, if the feedback rewards are larger than a preset value, the mutation strategy is effective, otherwise, the mutation strategy is ineffective.

In this embodiment, as shown in fig. 3, according to the correlation characteristics of the sample mutation process, the present invention proposes a method for dynamically optimizing the sample mutation strategy. Aiming at the problem that only the current mutation strategy is considered to be executed by the reward function and then the branch detection information is divided, a sample state s is introduced _t All historical execution path information of a strategy function pi () of a mutation action in the input sample mutation process is counted to obtain information of triggering new branch times, and an average triggering new branch update rate is defined aiming at distribution information of mutation strategies generating triggerable new branches

The mutation strategy is used for measuring the distribution condition of the new branches triggered in the whole mutation process. On the basis, a reward function of the total number of times of triggering new branches based on mutation strategies and the average new branch update rate is formed>

And the dynamic and accurate adjustment is further implemented on the reward feedback by utilizing all the historical execution path information, so that the dynamic optimization of the sample mutation strategy is realized, and the sample mutation efficiency in the whole defect mining process is improved.

(1) Rewards for triggering new branch total times based on mutation strategy

In the continuous mutation process of the sample state, the mutation strategy with more number of triggered new branches has stronger optimizing capability on the input sample, and rewards should be given to improve the priority of the mutation strategy in the next execution. In the existing reward function, only the condition that the mutation strategy function pi () is optimized for the input sample in the current mutation operation is considered, and the condition of triggering the new branch times is not considered, so that the reward function needs to pay attention to the information of the total times of triggering the new branch by the sample pairs in the sample mutation process.

(2) Rewards for average triggering of new branch update rates based on mutation policies

In the mutation process of the input sample, besides the total number of new branches of the branch detection result in the history execution process of the mutation strategy can evaluate the optimizing capability of the new branches on the input sample, the average triggering of the new branch update rate also needs to be focused, and especially the occurrence sequence of the mutation strategy causes the influence on the average triggering of the new branch update rate. For example, in the 30 mutation processes, two mutation strategies are updated on the branch detection result in the 10 mutation processes, but one mutation strategy is that the branch detection result is updated in the first 10 mutation processes, the other mutation strategy is that the branch detection result is updated in the adjacent 10 mutation processes, the average trigger new branch update rate of the mutation strategies using the adjacent 10 mutation processes is higher, the probability of updating the branch detection result for the next mutation of the input sample is higher, and the priority should be higher.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The binary vulnerability discovery process dynamic optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

2. The method for dynamically optimizing a binary vulnerability discovery process based on deep reinforcement learning of claim 1, wherein converting the current state into a variant sample state by a variant strategy specifically comprises:

a _t ＝π(s _t )

s _t ′＝Mutate(s _t ,a _t )

wherein, the mutation () is a mutation function.

3. The method for dynamically optimizing a binary vulnerability discovery process based on deep reinforcement learning of claim 1, wherein selecting the input sample state of the next time step based on the binary program sample execution path information specifically comprises:

if the current sample state at time step t is s _t Variant sample state s _t ' and a new sample execution path p is generated during execution _t Then sample execution path p _t Add to queue Q _s And update set P, and will s _t ' input sample state as next time step; if no new sample execution path p is generated during execution _t Then from the effective sample queue Q _s As the input sample state for the next time step, as shown in the following equation:

4. The dynamic optimization method for the binary vulnerability discovery process based on deep reinforcement learning of claim 1, wherein the feedback rewards are calculated for the variant sample states based on the coverage index, and specifically comprise:

′

M _t ＝{m _i,j |m _i,j >0,m _i,j ∈M _t }

according to record subset M _t ' sum record setM _t The proportion of the skip edges of the current sample to all skip edges of the target program is calculated when the current sample is executed and is used as feedback rewards of the current sample, and the specific calculation formula is as follows: