CN116893970A

CN116893970A - API misuse detection method and device based on frequent subgraph mining

Info

Publication number: CN116893970A
Application number: CN202310890481.7A
Authority: CN
Inventors: 蒋家盛; 吴敬征; 凌祥; 罗天悦; 武延军
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-17

Abstract

The invention discloses an API misuse detection method and device based on frequent subgraph mining, wherein the method comprises the following steps: constructing a data set containing a plurality of API source codes, and generating at least one API path from each API source code; based on the occurrence frequency of each sub-graph in all API paths, an API path mode set corresponding to the data set is obtained; and sub-graph matching is carried out on the API path of the source code to be detected and the API path mode set, so that an API misuse detection result of the source code to be detected is obtained. The invention can reduce the labor cost, expand the detection range and enhance the software safety.

Description

API misuse detection method and device based on frequent subgraph mining

Technical Field

The invention relates to the field of API misuse error detection, in particular to an API misuse detection method and device based on frequent subgraph mining.

Background

With the rapid development of software supply chains, more and more software developers write codes based on APIs packaged in code libraries. However, due to the missing, incomplete, lagging, etc. features of the API document, and the general lack of knowledge of the API by the developer, the developer violates the mode of the API, i.e., the correct method of using the API, when using the API, resulting in an API misuse error. Statistics show that 92 Linux kernel vulnerabilities with repair suggestions are disclosed on the MITRE website in 2021, wherein 27 vulnerabilities are caused by misuse of the API, and the ratio of the vulnerabilities is 29.3%, which is a main vulnerability type. Taking the common API of kmalloc class (including kmalloc, kvmalloc, kcalloc, etc.) as an example, their modes are { check allocated memory size → kmalloc allocated memory → check returned pointer → release pointer }. This step of checking the allocated memory size may result in remote code calls, such as CVE-2021-33909 and CVE-2021-43267, if absent. If the "release pointer" is absent, this step may result in a memory leak error, such as CVE-2021-45480.

In the prior art, when detecting an API misuse error, an API mode needs to be defined manually, and then a violation error in a code is detected in a mode of mode matching. This not only increases the threshold for API misuse detection, but also limits the range of detection. Thus, manually setting the API mode may result in a large number of false negatives, thereby compromising software.

Disclosure of Invention

Aiming at the defects of the prior art, the invention discloses an API misuse detection method and device based on frequent subgraph mining.

In order to achieve the above purpose, the invention adopts the following technical scheme:

an API misuse detection method based on frequent subgraph mining, the method comprising:

constructing a data set containing a plurality of API source codes, and generating at least one API path from each API source code;

based on the occurrence frequency of each sub-graph in all API paths, an API path mode set corresponding to the data set is obtained;

and carrying out sub-graph matching on the API path of the source code to be detected and the API path mode set to obtain an API misuse detection result of the source code to be detected.

Further, the generating the API path from the source code includes:

converting the source code into an intermediate representation, wherein the intermediate representation comprises a control flow graph of a function, and the format of the intermediate representation is LLVM IR file;

traversing a control flow graph of each function in the intermediate representation by using a depth-first strategy to obtain an API calling operation, and calling an application interface in an LLVM C library to analyze the API calling operation so as to obtain a data association operation of the API;

and connecting the data association operation according to the traversing execution sequence to obtain the API path with the storage form of the control flow graph.

Further, traversing the control flow graph of each function in the intermediate representation by using the depth-first policy to obtain an API call operation, and calling an application interface in the LLVM C library to parse the API call operation to obtain a data association operation of the API, including:

traversing each function, each basic block in the function and each operation statement in the basic block in the LLVM IR file through an iterator to obtain API calling operation;

calling a parseRFile application interface in the LLVM C library to analyze the API calling operation to obtain a return value variable and a parameter variable of the API;

and carrying out data flow analysis on the return value variable and the parameter variable of the API to obtain the data association operation of the API.

Further, the data association operation includes:

direct data association operation, wherein shared variables exist in the direct data association operation or direct data association with assignment relation exists in the direct data association operation;

and, a step of, in the first embodiment,

an indirect data association operation in which there is an indirect data association of pointer transfer between variables.

Further, the performing data flow analysis on the return value variable and the parameter variable of the API to obtain a direct data association operation includes:

acquiring parameters of the API by using a getOperand application interface;

recursively calling getOperand to obtain variables assigned to the API parameters;

recursively calling a users application interface to acquire a return value of the application API;

adding the variable of the API parameter and the API return value to an associated variable;

and obtaining the operation using the related variable in the function control flow graph to obtain the direct related operation.

Further, the performing data flow analysis on the return value variable and the parameter variable of the API to obtain an indirect data association operation includes:

and aiming at the memory read/write operation in the direct association operation, taking the corresponding memory write/read operation with the same address as the indirect association operation.

Further, based on the occurrence frequency of each sub-graph in all API paths, an API path mode set corresponding to the data set is obtained, which comprises the following steps:

splitting each API path into subgraphs comprising one edge, and adding each subgraph into a 1-edge-candidate set;

screening out 1-side frequent sets from the 1-side candidate sets according to the occurrence times of the subgraphs;

generating a k-edge-candidate set from the k-1-edge-frequent set by combining sub-graphs having k-2 identical edges two by two;

screening k-edge-frequent sets from the k-edge-candidate sets according to the frequency of edge occurrence;

when the k-edge frequent set is not empty, let k=k+1, and return to generating a k-edge-candidate set from the k-1-edge-frequent set by combining sub-graphs having k-2 identical edges two by two;

and taking the subgraphs in the k-1 side-frequent set as the API path mode set corresponding to the data set under the condition that the k-side-frequent set is empty.

Further, the step of screening the 1-edge-frequent set from the 1-edge-candidate set according to the occurrence times of the subgraph comprises the following steps:

calculating the number of times each sub-graph appears in all API paths; under the condition that the same sub-graph appears for a plurality of times in any API path, judging that the number of times of the sub-graph appearing in the API path is 1;

calculating the ratio of the number of times of occurrence of the subgraph in all API paths to the total number of the API paths;

adding the sub-graph to a 1-sided-frequent set if the number of times is greater than a first set threshold and the ratio is greater than a second set threshold;

and discarding the subgraph when the number of times is smaller than a first set threshold or the ratio is smaller than a second set threshold.

An API misuse detection apparatus based on frequent subgraph mining, the apparatus comprising:

the API path extraction module is used for constructing a data set containing a plurality of API source codes and generating at least one API path from each API source code;

the path mode generating module is used for obtaining an API path mode set corresponding to the data set based on the occurrence frequency of each sub-graph in all the API paths;

and the misuse detection module is used for sub-graph matching the API path of the source code to be detected with the API path mode set to obtain an API misuse detection result of the source code to be detected.

A computer device, the computer device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the frequent sub-graph mining-based API misuse detection method of any one of the above.

Compared with the prior art, the invention has at least the following technical effects:

the source code is converted from a text form into a graph form, and is defined as an API path, each node in the API path is an operation (comprising the API itself) with data association with the API, and each side represents the execution sequence among the nodes.

Frequent subgraphs are mined from the API path, defined as API path patterns.

And by utilizing the downward closure property, a large number of non-frequent candidate subgraphs are filtered in advance, the problem of low excavation efficiency of frequent subgraphs is solved, and quick excavation of the frequent subgraphs is realized.

And sub-graph matching is carried out on the API path mode and the API path, and the API path with failed matching is used as a potential API misuse error.

The potential API misuse errors are ranked, and the more the number of times the corresponding API path mode appears in the API path, the higher the possibility that the potential API misuse errors are considered to be true.

Drawings

FIG. 1 is a flow chart of an API path pattern automatic extraction and API misuse detection method based on frequent subgraph mining.

FIG. 2 is a flow chart of API path generation.

FIG. 3 is a flow chart of API path pattern extraction.

FIG. 4 is an exemplary diagram of API path patterns and API misuse; the upper left is the source code of the API path mode, the upper right is the API path mode converted by the source code, the lower left is the misuse source code of the API, and the lower right is the misuse of the API converted by the source code.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The method for automatically extracting the API path mode and detecting the API misuse based on frequent subgraph mining in the embodiment mainly comprises the following steps:

step 1: an API path is generated from the source code.

An API path is a control flow graph that is made up of operations that have data associations with APIs, where the data associations include both: direct data association refers to the existence of shared variables or the existence of assignment relationships, and indirect data association refers to the existence of pointer transfer between variables, i.e., memory read/write. The method comprises the steps of converting source codes into intermediate representation LLVM IR by using an LLVM front-end compiler Clang, traversing a function control flow graph in the LLVM IR to obtain API calling operation in the LLVM IR, analyzing data flow aiming at an API, generating an API path, and storing the API path in a graph form.

Specifically, the API path generation flowchart is shown in fig. 2.

1a) Converting source code into LLVM IR in the form of an intermediate representation, wherein the LLVM IR contains control flow graphs of various functions, and the intermediate representation is stored in a file with the type of "bc" and is a structured data representation, and the transition is made to 1 b)

1b) Invoking the application interface in the LLVM C library, traversing each function control flow graph in LLVM IR using the depth-first policy, going to 1C).

In one example, a parseIRFile application interface in the LLVM C library is used for parsing, and then each function in the LLVM IR file, each basic block in the function, and each operation statement in the basic block are traversed through an iterator to obtain an API call operation. The API call operation is an operation statement with the type of 'CallInst', and the name of the API can be obtained by calling the getCalldFunction and the getName application interface.

1c) And when the API call operation is accessed each time in traversal, carrying out data flow analysis on the return value variable and the parameter variable of the API to acquire the data association operation of the API. For each API data association operation, if the operation is not a memory read/write operation, it acts as a direct data association operation; if the operation is a memory read/write operation, the corresponding memory write/read operation of the same address is an indirect data association operation. Turning to 1 d).

In one example, the data association operation includes: direct data association operations and indirect data association operations. There is a shared variable or a direct data association of an assigned relationship in a direct data association operation, and there is an indirect data association of pointer transfer between variables in an indirect data association operation.

Specifically, when the data stream analysis is performed on the API, the parameters of the API are obtained by using a getOperand application interface on the API, and then the getOperand is recursively called to obtain variables assigned to the parameters of the API, wherein the variables are used as part of the associated variables; then recursively calling a users application interface to acquire the variable using the API return value and the related variable acquired in the last step, and adding the variable into the related variable; and finally traversing the function control flow graph to obtain the operation using the related variable as the direct related operation.

The indirect data association operation is to acquire the corresponding memory read/write operation of the same address as the indirect association operation aiming at the memory read/write operation in the direct association operation, and unify the operation and the direct association operation as the association operation.

1d) And traversing the function control flow graph again in depth first, connecting the data association operation according to the traversed execution sequence, finally obtaining the control flow graph consisting of the API data association operation, outputting the control flow graph as an API path, and storing the control flow graph in a graph form.

In the example of fig. 4, the API path on the right is generated with the source code on the left. The API is grey-bottom_usecs_to_jiffies, each node in the path is a data association operation of_usecs_to_jiffies, and each edge represents the execution sequence among the nodes.

Step 2: the API path pattern is extracted.

In general, most codes are correct, i.e. most APIs follow an API path pattern, and the API pattern is usually composed of API data association operations, so reasonable reasoning can be done: frequently occurring subgraphs in an API path may be generally considered an API path pattern. Based on the reasoning, the invention utilizes frequent subgraph mining technology to automatically extract the API path mode from the API path. However, frequent sub-graph mining presents a bottleneck: the number of subgraphs is exponentially scaled. For any graph with n edges, its total number of subgraphs is 2 ⁿ The specific calculation formula is shown as follows, namely the number of sub-graphs from 1 side to n sides is added.

To address this problem, the present invention exploits the downward closure property of frequent subgraphs, i.e., when a graph is frequent, then any subgraph of the graph must also be frequent. Its inverse proposition must also be correct: if a sub-graph is infrequent, then any graph that contains that sub-graph must also be infrequent. The infrequent candidate set can be removed in advance by using the inverse proposition, so that the mining efficiency of frequent subgraphs is effectively improved.

In one embodiment, the present invention provides an automatic extraction algorithm for an API path mode, as shown in fig. 3, where the input of the algorithm is an API path and the output is an API path mode, and the specific steps include:

2a) Splitting the respective API paths from the graph into a set of edges, adding each edge as a 1-edge-subgraph to the "1-edge-candidate set", and going to 2 b).

2b) The "1-sided-frequent set" is selected from the "1-sided-candidate set" according to the number of times each element appears in the API path, and go to 2 c).

In one embodiment, the present invention calculates the number of times each element in the "1-edge-candidate set" occurs in all API paths (only 1 if it occurs multiple times in the same API path), and adds it to the "1-edge-frequent set" if the number of times is 10 or more and the ratio to the total number of API paths is 0.9 or more.

2c) "k-edge-candidate set" is generated from "(k-1) edge-frequent set", go to 2 d).

In one embodiment, the invention combines elements of the "(k-1) edge-frequent set" having (k-2) identical edges two by two, removing duplicates, as a "k-edge-candidate set".

2d) And (3) screening a 'k-edge-frequent set' from the 'k-edge-candidate set' according to the occurrence times of edges, if the 'k-edge-frequent set' is not empty, adding 1 to the value of k, skipping to 2 c), otherwise, skipping to 2 e).

2e) And finally, the screened (k-1) edge-frequent set is used as a frequent subgraph of the API path, namely the path mode of the API.

In the example of FIG. 4, the graph located on the upper right is the_usecs_to_files path pattern obtained using frequent sub-graph mining techniques. Specifically, it appears 146 times as a sub-graph in 147 paths of_usecs_to_jiffies, so the output obtained using the frequent sub-graph mining technique is the upper right graph, which is the_usecs_to_jiffies path pattern.

Step 3: API misuse errors are detected.

The invention takes the API path mode as a matching template, detects whether the API path mode is contained in the API path as a subgraph, and outputs and reports the API path mode as a potential API misuse error if the API path mode is not contained in the API path.

In one embodiment, the present invention takes the API path and the API path pattern as inputs, and the output is a potential API misuse error.

Specifically, the detection process of the API misuse error is as follows.

3a) Acquiring an API path of a source code to be detected;

3b) And carrying out sub-graph matching on the API path of the source code to be detected and the API path mode, and if the matching fails, turning to 3 c) as a potential API misuse error.

3c) According to the occurrence times of the violated API path modes as a sequencing standard, the API misuse errors violating the API path modes for more times have higher priority and are then output as reports according to the order of the priority.

In the example of fig. 4, the lower right_usecs_to_files path is sub-graph matched with the upper right_usecs_to_files path pattern, and as a result, the upper right path pattern is not a sub-graph of the lower right path, and the matching fails, thus reporting as a potential API misuse error.

In summary, the technical solution of the present invention is as follows: the method for detecting the misuse of the API based on the frequent subgraph mining is provided, the common use method of the API is proposed by utilizing the statistical rule in the code under the condition that software related priori knowledge is not needed, the common use method is used as the mode of the API, and the misuse error of the API is detected by utilizing the subgraph matching method.

Claims

1. An API misuse detection method based on frequent subgraph mining, the method comprising:

2. The method of claim 1, wherein generating an API path from source code comprises:

3. The method of claim 2, wherein traversing the control flow graph of each function in the intermediate representation using a depth-first policy to obtain an API call operation and calling an application interface in an LLVM C library to parse the API call operation to obtain a data-dependent operation of the API, comprising:

4. A method as claimed in claim 2 or 3, wherein the data association operation comprises:

and, a step of, in the first embodiment,

5. The method of claim 4, wherein the performing data flow analysis on the return value variable and the parameter variable of the API to obtain the direct data association operation comprises:

acquiring parameters of the API by using a getOperand application interface;

6. The method of claim 5, wherein the performing data flow analysis on the return value variable and the parameter variable of the API to obtain the indirect data association operation comprises:

7. The method of claim 1, wherein obtaining the set of API path patterns corresponding to the data set based on the occurrence frequencies of the sub-graphs in all API paths, comprises:

8. The method of claim 7, wherein the screening the 1-sided-frequent set from the 1-sided-candidate set by the number of occurrences of the subgraph comprises:

9. An API misuse detection apparatus based on frequent subgraph mining, the apparatus comprising:

10. A computer device, the computer device comprising: a processor and a memory storing computer program instructions; the processor, when executing the computer program instructions, implements the frequent sub-graph mining based API misuse detection method of any one of claims 1-8.