CN115391785A

CN115391785A - Method, device and equipment for detecting risks of software bugs

Info

Publication number: CN115391785A
Application number: CN202210988093.8A
Authority: CN
Inventors: 吴荣鑫; 陈豪尔; 黄嘉峰; 王超; 范刚
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-11-25

Abstract

The embodiment of the specification discloses a method, a device and equipment for detecting the risk of a software bug. The scheme comprises the following steps: acquiring a code of software to be detected, and extracting an API (application program interface) in the code; searching a set vulnerability database according to the API in the code to determine a target API which may have risks in the API in the code, and acquiring a vulnerability API corresponding to the target API in the vulnerability database and a repaired API obtained after the vulnerability API is repaired; extracting a differential AST feature between code of the vulnerability API and code of the repaired API, and a contextual AST feature corresponding to the differential AST feature; extracting respective AST features for the target API to match the differential AST features and the contextual AST features; and determining whether the target API has risks according to the matching result.

Description

Method, device and equipment for detecting risks of software bugs

Technical Field

The present disclosure relates to the field of computer security technologies, and in particular, to a method, an apparatus, and a device for detecting a risk of a software bug.

Background

The third-party software library is a code library which is developed by a third party and can realize specific functions, and a user can directly call the third-party software library to use the corresponding functions. The third-party software library can prevent developers from repeatedly developing code of the same function. In the process of developing projects such as Java, in order to improve the development efficiency, a large number of third-party software libraries are often required to be called to complete some specified functions. However, in general, a developer may use a third-party software library as a black box for implementing a specific function, and the developer may easily ignore the internal security problem due to lack of understanding of the internal implementation details of the third-party software library. If the security of the third-party software library cannot be guaranteed, the vulnerability of the third-party software library can be introduced into the host project under development. Therefore, it is important to ensure the security of the third-party software library introduced by the project, and certain security detection needs to be performed on the third-party software library before the third-party software library is introduced.

In order to implement detection of a dangerous call, some current solutions provide for detecting whether a dangerous Application Programming Interface (API) of a third-party software library containing a bug is reachable using call graph analysis. However, this kind of scheme can only detect whether the target API is reachable, and cannot determine whether the target API contains an unrepaired vulnerability, and if the vulnerability API in the third-party software library has been repaired, even if there are paths to the target API, these call paths will not generate a security threat.

Based on this, there is a need for a more reliable risk detection scheme for software vulnerabilities to help determine whether software, such as third-party software libraries, can be safely invoked.

Disclosure of Invention

One or more embodiments of the present specification provide a method, an apparatus, and a device for detecting a risk of a software vulnerability, so as to solve the following technical problems: a more reliable risk detection scheme for software vulnerabilities is needed to help determine whether a third-party software library can be safely invoked.

To solve the above technical problem, one or more embodiments of the present specification are implemented as follows:

one or more embodiments of the present specification provide a method for detecting a risk of a software bug, including:

acquiring a code of software to be detected, and extracting an API (application program interface) in the code;

searching a set vulnerability database according to the API in the code to determine a target API which may have risks in the API in the code, and acquiring a vulnerability API corresponding to the target API in the vulnerability database and a repaired API obtained after the vulnerability API is repaired;

extracting a differential AST feature between code of the vulnerability API and code of the repaired API, and a contextual AST feature corresponding to the differential AST feature;

extracting respective AST features for the target API to match the differential AST features and the contextual AST features;

and determining whether the target API has risks according to the matching result.

One or more embodiments of the present specification provide a risk detection apparatus for a software bug, including:

the method comprises the steps that an API extraction module can be detected, codes of software to be detected are obtained, and APIs in the codes are extracted;

the to-be-matched API determining module is used for searching in a set vulnerability database according to the API in the code so as to determine a target API which may have risks in the API in the code, and acquiring a vulnerability API corresponding to the target API in the vulnerability database and a repaired API obtained after the vulnerability API is repaired;

an API feature extraction module to extract a differential AST feature between a code of the vulnerability API and a code of the repaired API, and a context AST feature corresponding to the differential AST feature;

an API feature matching module that extracts a corresponding AST feature for the target API to match the differential AST feature and the contextual AST feature;

and the API risk determining module is used for determining whether the target API has risks according to the matching result.

One or more embodiments of the present specification provide a risk detection device for a software bug, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to:

At least one technical scheme adopted by one or more embodiments of the specification can achieve the following beneficial effects: the method has the advantages that an Abstract Syntax Tree (AST) is used for describing codes in a finer granularity mode, the code structure and the characteristics can be described in more detail, compared with statements in a coarser granularity mode, the AST can more accurately position the modification of bug fixing, the fine modification can be presented in a node change mode, for example, the type and the variable name are modified, or the constant is replaced, and the part really involved in the fixing can be obtained by extracting sub-trees of the change related part of the bug fixing, so that the more accurate positioning is realized; meanwhile, the matching of the target API and the vulnerability API and the corresponding repaired API is considered, the matching precision is indirectly improved through comparison, and the target API which originally has the vulnerability and is repaired at present is prevented from being accidentally injured; moreover, considering that a fine-grained manner may increase the risk of mismatching (because the matched code elements are relatively few, other similar segment matches may exist when finding the corresponding matched code segment in the target file), in order to filter the mismatched segments, the corresponding context is also obtained as a condition for vulnerability formation to be taken into consideration; thus, the risk detection of the software bug can be more reliably carried out, and the method is also helpful for more reliably judging whether software such as a third-party software library can be safely called.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a schematic diagram of a bug fix patch and a target file according to one or more embodiments of the present disclosure;

fig. 2 is a schematic flowchart of a method for detecting risk of a software bug according to one or more embodiments of the present disclosure;

fig. 3 is a schematic diagram of an AST structure of Java provided in one or more embodiments of the present specification;

FIG. 4 is a schematic diagram of a bug fix patch and an API including the bug according to one or more embodiments of the present disclosure;

FIG. 5 is a diagram illustrating the structure of various subtrees extracted for matching according to one or more embodiments of the present disclosure;

FIG. 6 is a schematic representation of a tree model description of a feature provided in one or more embodiments of the present description;

fig. 7 is a schematic structural diagram of a risk detection apparatus for a software bug according to one or more embodiments of the present disclosure;

fig. 8 is a schematic structural diagram of a risk detection device of a software vulnerability, provided in one or more embodiments of the present specification.

Detailed Description

The embodiment of the specification provides a method, a device, equipment and a storage medium for detecting the risk of a software bug.

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any inventive step based on the embodiments of the present disclosure, shall fall within the scope of protection of the present application.

The method and the device make various attempts, gradually solve various new problems on the basis of an initial thought, gradually improve an initial scheme, and finally obtain a preferred scheme with a good effect.

According to the method and the device, whether the target project calls the API with the known bug in the third-party software library or not is determined, and some necessary matching work is needed to determine whether the third-party software library is a version with the bug not repaired yet. Based on this, it is necessary to perform feature summarization and matching on the specific code content of the third-party software library.

Initially, considering the characteristics of statements obtained by using a hash algorithm, only when text information is completely the same, the hash values are the same, otherwise, completely different hash values are generated, but in an API corresponding to a target file, a code segment may generate a partial change, which results in incomplete similarity to code segments of versions before and after bug fixing, for example, a difference introduced by decompiling a third-party software library, or in an earlier or later library version, a third-party software library developer modifies the partial code, or a host developer adjusts the code of the third-party software library to make it better conform to a project scenario, and the like, in these cases, a related code segment and a bug fixing code segment may come in and go out, so that a matching scheme based on the hash algorithm is invalid, and such a scheme is too coarse in granularity, cannot accurately describe bug-related modifications, and is too sensitive to code modifications. Intuitively, referring to fig. 1, fig. 1 is a schematic diagram of a bug fixing patch and a target file provided by one or more embodiments of the present disclosure, in fig. 1, (a) a fixing patch is described, in which line 156 replaces a called API named isCorsAccessAllowed by isoallowed by isooriginallowed and adds a parameter false, (b) a corresponding target API in a target file to be detected, which is an API after a bug has been fixed, and does not include a conditional statement "ploigiin |! = null ", therefore, the existing way of hash matching by statement cannot determine the matching relationship between the 156 th line in (a) and the 156 th line in (b), and cannot recognize that the target API has been repaired.

Based on this, the present application proposes to use AST to describe the code more finely, which can describe the code structure and features in more detail, compared with a statement-based approach, AST can more accurately locate the modification of bug fixes, and for slight modification, it can be presented by means of node change, such as modifying the type, variable name, or replacing constant, etc. The part really involved in the repair can be obtained by extracting the subtree of the part related to the change of the bug repair, so that more accurate positioning is realized. However, the fine-grained manner also increases the risk of mismatching, and since the number of matched code elements is too small, other similar fragment matching may occur when finding the corresponding matched code fragments in the target file, and in order to filter the mismatched fragments, the scheme also obtains the corresponding context thereof as a condition for vulnerability formation and takes the condition into consideration.

When the context is obtained, 3-5 lines of code statements near a repair code are considered as the context, however, the semantic relation between the statements and a bug repair segment cannot be guaranteed in this way, and therefore, the relevance between the context and the bug repair segment cannot be proved, and further improvement is achieved. The principle is as follows: for example, for the obtained user input, filtering is often required before the use, and operations of malicious injection existing in direct use are prevented, so that the bug is repaired by adding some verification on the user input between the obtaining of the user input and the use, at this time, the definition and the use relation of the variable form a condition for generating the bug, and the definition part is used before the verification of the segment, and the use part is used after the verification of the segment.

The scheme of the application can comprise two parts, namely an acquisition part and a detection matching part of the vulnerability database. Wherein, the matching detection part can be divided into two stages: a suspicious target API location phase and a matching phase. The method and the device have the advantages that the fine-grained vulnerability detection scheme based on the vulnerability repair patch of the third-party software library is realized, some test data sets are constructed for testing, and a better effect is achieved. The following mainly addresses the above concepts in detail.

Fig. 2 is a schematic flowchart of a method for detecting risk of a software bug according to one or more embodiments of the present disclosure. The method can be applied in different business fields, such as: the electronic payment business field, the electric business field, the instant messaging business field, the game business field, the official business field and the like. The process can be executed on equipment with a risk detection requirement of code software bugs in the fields, such as a development tester, and can be deployed to various periods of software development besides the development tester, such as an admission check machine, a version machine and the like. Certain input parameters or intermediate results in the procedure allow for manual intervention adjustments to help improve accuracy.

Fig. 2 mainly shows the detection matching part, and utilizes the existing vulnerability database or constructs the vulnerability database by itself, for executing the process in fig. 2. By taking the self-constructed database as an example, vulnerability information can be crawled from some open-source databases, patch segments corresponding to vulnerability repair are obtained, the collected information is analyzed subsequently to obtain a final vulnerability database, and the vulnerability database is used as a reference object for subsequent vulnerability positioning and matching.

The flow in fig. 2 may include the following steps:

s202: acquiring a code of the software to be detected, and extracting an API in the code.

The APIs in the code may also be referred to as methods, and in order to avoid confusion, are referred to as APIs in most embodiments. For a scene in the background art, software to be detected belongs to a third-party software library, and the following embodiments mainly take the third-party software library as an example. The software to be detected can be any software which needs to detect the vulnerability risk, in particular external software which is not clear in current security but is called, and the target file comprises part or all of codes of the software to be detected. When the API which is to be detected is more definite, the API is directly extracted from the code of the software to be detected, otherwise, all or most of the API in the code of the software to be detected can be extracted for general screening.

S204: according to the API in the code, searching is carried out in a set vulnerability database to determine a target API which may have risks in the API in the code (namely, a suspicious target API is positioned), and a vulnerability API corresponding to the target API in the vulnerability database and a repaired API obtained after the vulnerability API is repaired are obtained.

In one or more embodiments of the present specification, when a third-party software library is obtained, it is decompiled (more reliable if its code is known, without decompiling), so as to obtain a code of the third-party software library, then, all APIs in the project are extracted, the full names of all APIs (from the root directory of the package to the whole path of the method) are obtained as search conditions, the recorded vulnerability APIs are obtained in the vulnerability database, the objects that may match the APIs are extracted, and all relevant vulnerability APIs and the corresponding codes of the APIs after the vulnerability is repaired are obtained. Of course, the search strategy is various, and the fuzzy search strategy can be adopted according to actual needs. The repaired API is obtained, for example, from the bug fix patch described above.

S206: extracting a difference Abstract Syntax Tree (AST) feature between the code of the vulnerability API and the code of the repaired API, and a context AST feature corresponding to the difference AST feature.

In the stage, codes of the API to be detected, the vulnerability API and the repaired API which are acquired in the positioning stage and are possibly in risk are taken as input, feature extraction is carried out on the three code segments, and after the feature extraction is completed, respective features, namely the features of the target API, the features of the vulnerability API and the features of the repaired API are output. When the vulnerability characteristics are matched, it is necessary to determine whether the characteristics of the target API are similar to the characteristics of the vulnerability API or the repaired API, so as to determine whether the target API is a vulnerability API that has not been repaired.

The AST describes the syntactic information of the code through the structure of the tree, and is an abstract representation of the code grammar, wherein each node is a representation of the code structure and the text information. Each bug fix is necessarily accompanied by a change of the code, and the corresponding AST changes while the code changes. The mode of directly describing bug fixes by text changes of codes cannot determine the syntactic structure of the changes, and does not generate comprehensiveness interpretation on the codes. Unlike this direct description manner, the change on the AST tree can more finely determine the syntax information of the changed part, and thus can more finely describe the vulnerability.

More intuitively, see fig. 3. Fig. 3 is a schematic diagram of an AST structure of Java according to one or more embodiments of the present specification. In FIG. 3, each non-leaf node represents a Java connection structure (e.g., ifStatement indicates that the node is the start of an if-conditional statement, the following nodes are the contents of the conditional statement, methodInvocation indicates that its substructure is a function call), the leaf node represents the code part text information and the corresponding meaning, AST in FIG. 3 (a) corresponds to row 156 of the deletion in FIG. 1 (a), the root node indicates the start of the if-conditional statement, the INFE node after the if-conditional statement indicates the following infix expression, and the left sub-tree of the node corresponds to the expression "pOrigin! The right part represents the structure of another API call expression "backup manager. Isco-rsaccesssallowed (ploigin)", and the part of subtree nodes describe the structure of the API call, including the called object backup manager, the API name issacoccussatllowed, and the parameter ploigin, the IEO stands for the connector &, and the above AST as a whole constitutes the syntax structure of the 156 th line code.

Since the AST is tree-structured, the differential AST feature, the context AST feature, and the corresponding AST feature to be described later may also be expressed as nodes or subtrees (here, a subtree mainly refers to a case where two or more nodes are included), and a node may also be regarded as a simplest subtree. The following describes the manner of extracting these three types of features.

In one or more embodiments of the present specification, comparing the AST of the code of the vulnerability API with the AST of the code of the repaired API, obtaining a set of deleted subtree nodes (a set composed of deleted subtrees and/or nodes) and a set of added subtree nodes (a set composed of added subtrees and/or nodes) that reflect corresponding repair changes, and determining a differential AST characteristic between the code of the vulnerability API and the code of the repaired API according to the set of deleted subtree nodes and the set of added subtree nodes.

Specifically, for example, for better explanation, the vulnerability API is abstractly described here, and it is assumed that the vulnerability API is denoted as APIv, the corresponding AST is ASTv, the corresponding repaired API is APIp, and the corresponding AST is ASTp. Obtaining specific deleted, added and updated AST subtrees or a single node list by comparing ASTv and ASTp, wherein the added, deleted and updated subtrees are respectively represented as Listadd, listdel and Listrep, the nodes are respectively represented as ListaddNode, listdelNode and ListrepNode, and are respectively defined as follows:

the deleted sub-tree is represented by tuple < ST1, nil >, which means that the sub-tree ST1 needs to be deleted when the ASTv is changed into ASTp;

the added subtree is represented by a tuple < Nil, ST2>, and the subtree ST2 is required to be added when the ASTv is changed into ASTp;

the replaced subtree is represented by a tuple < ST1, ST2>, which means that the subtree ST1 needs to be replaced by the subtree ST2 when the change from ASTv to ASTp is carried out;

the deleted node is represented by tuple < SN1, nil >, which means that the node SN1 needs to be deleted when the ASTv is changed into ASTp;

the added node is represented by tuple < Nil, SN2>, which means that node SN2 needs to be added when the ASTv is changed into ASTp;

the replaced node is represented by the tuple < SN1, SN2>, meaning that a change from ASTv to ASTp requires replacement of node SN1 with node SN2.

In order to extract the subtrees of the code difference part and describe the characteristics before and after bug fixes, two sets DSetv and DSetp are defined, which respectively represent the deleted subtree node set and the added subtree node set, and correspond to the bug characteristics and the fixed characteristics, wherein the subtree or node elements are obtained from the list. The delete subtree list Listdel and delete node list ListdelNode contain deleted subtree tuples < ST1, nil > and deleted node tuples < SN1, nil >, which subtree ST1 and node SN1 will be added to the DSetv. The added subtree list Listadd and the added node list ListaddNode contain an added subtree tuple < Nil, ST2> and a node tuple < Nil, SN2>, and the subtree ST2 and the node SN2 will be added to the DSetp. The replacement subtree list contains replacement subtree tuples < ST1, ST2> and node tuples < SN1, SN2>, wherein subtree ST1 and node SN1 are to be added to DSetv and subtree ST2 and node SN2 are to be added to DSetp.

It is sufficient to use the deleted subtree node set and the added subtree node set as the different AST features directly, however, in practical applications, the number of elements in these sets may be large, and if the subtrees in these sets are used as matching objects and are located in the AST corresponding to the target API one by one, it takes a high time cost to find the corresponding subtrees. In fact, these two sets may contain many single scattered node changes, but represent modifications of the same part, as shown in fig. 3 (a), the node "SN: iscorsessallowed" is included in the DSetv, and "SN: isogonidinallowed" and "BL: false" in fig. 3 (b) are added to the DSetp, but actually "SN: isogonidinallowed" and "BL: false" represent modifications of the same part. If the scattered nodes are directly matched, the matching efficiency is influenced, and mismatching is easy to generate, so that the similar parts are considered to be uniformly integrated and described, and the efficiency and the reliability are improved.

Based on this, in one or more embodiments of the present specification, for a plurality of nodes in the deleted subtree node set and the added subtree node set, a common ancestor node of the plurality of nodes is traced, and it is determined whether the common ancestor node indicates that the plurality of nodes are in the same program statement, if so, the common ancestor node is used as a root node to generate an AST subtree including the plurality of nodes, which is used as a differential AST feature between a code of the vulnerability API and a code of the repaired API.

Specifically, for example, for nodes in the same code statement, a common ancestor node of the nodes is obtained as a root node of a final characteristic subtree, and all node or subtree changes taking the node as the root node are included and are not described separately. When tracing back a common node on AST, it is only traced back to a stateful node (the node indicates that its substructure is in the same code Statement), therefore, if two nodes or subtrees are not in the same program Statement, then a common ancestor below the stateful node cannot be found for them, two subtrees will be formed respectively, as in fig. 3, the common ancestor node of the nodes "SN: isogeninAllowed" and "BL: false", i.e. the METHOD import node, will be obtained, and the subtrees taking the node as root node will be added to DSetv and DSetp respectively as the characteristics of the final difference part for subsequent characteristic matching.

In one or more embodiments of the present specification, obtaining only the sub-tree features of the difference part is not enough to describe the vulnerability feature, and especially when the vulnerability repair patch modifies a small amount of code, when obtaining the matching sub-tree, a large amount of redundant matching is easily generated due to the small number of nodes needing matching, and then the extraction of the context is considered as a feature matching supplement. Because the method needs to be used as a supplement for feature matching, the context needs to have certain relevance with the vulnerability and can be used as a feature of vulnerability description, and the method mentioned above also has the unreasonable advantage that several lines of nearby codes are obtained directly without brain as the context. The context extraction scheme provided by the present application, including how to define which code segments can be used as contexts related to vulnerabilities and corresponding obtaining manners, is described herein.

For ease of understanding, the data flow, control flow, and program slice are briefly explained herein. A program basic block refers to a continuous code instruction set which does not comprise jump instructions, the program basic blocks are transferred when the jump instructions are carried out, the basic blocks are connected through edges, and the edges represent the execution sequence among the basic blocks. If the condition statement is satisfied or not, the subsequent execution of one of the two basic blocks is performed, or the loop condition points to the code block inside the loop body or the basic block after the loop is skipped. The control flow of the program is a code execution sequence diagram composed of the code blocks and edges.

The data flow analysis is specific to program variables or data, and is performed on the basis of a control flow graph to obtain interesting data transfer information, such as data analysis when all data sources for obtaining a variable value or a program is executed to a certain position. Program slicing refers to stripping and splitting statements in a program, only interested program statements are reserved for subsequent analysis, and other statements which do not need to be analyzed are stripped. For example, when a program is sliced for a variable X at a program point p, only statements that affect the variable X at the point are kept as slices for subsequent analysis.

In order to associate a context with a bug, a program slice may be used to obtain, after a program point where a variable related to a bug fix part is defined, all subsequent program statements having a data stream association with the variable, and statements having a control relationship with the bug fix statement (the existence of the control relationship means that, in a control flow graph, if a basic block a is passed before a basic block B, the basic block a is said to have a control relationship with the basic block B), and all the program statements obtained by the slice are used as contexts referred to by bug features. However, this approach also has significant problems, which may result in a large number of candidate sentences, in which case the matching is not large if the hash-based detection is performed, and may significantly affect the efficiency if the AST-based detection is performed. Therefore, further reduction of the above-described context is considered.

According to the research on the vulnerability, the formation of the vulnerability is often related to the wrong definition to use of the vulnerability, for example, the input of a user is believed and is directly used as the subsequent use, so for the repair of the vulnerability, repair codes are often inserted in the process from definition to use.

Based on this, in one or more embodiments of the present specification, it may be determined that the AST of the code of the repaired API is compared with the AST of the code of the vulnerability API, variables involved in the repair are obtained, definition nodes of these variables and then usage nodes of these variables are obtained in the AST of the code of the vulnerability API and/or the AST of the code of the repaired API, and a context AST feature of the different AST features is determined accordingly. For example, if all variables involved in the fixing are obtained as the starting point of the program slice, the defined node before the variables are obtained, and the used node for the change is obtained after the starting point as the vulnerability context, and taking (a) in fig. 4 as an example, if the part of the fixing (line 314) involves the actionMethod and the function parameter mapping, then the subsequent use of the mapping can be obtained (line 316 is not included because they are not in the same code branch), and the forward definition (line 302) and subsequent use of the actionMethod are also obtained.

In one or more embodiments of the present specification, a corresponding subtree is matched in the AST of the code of the target API according to the delete subtree node set and the add subtree node set, a context AST feature is extracted in the AST of the code of the target API for the successfully matched subtree, and at least one subtree and the correspondingly extracted context AST feature that are successfully matched are used as corresponding AST features extracted for the target API.

Specifically, for example, for the AST of the target API code, subtrees corresponding to DSetv and DSetp are searched separately and repeated subtrees are removed, and if there are multiple possible matches, all matching results may be retained, and each matching result is used as a candidate for subsequent feature matching.

More intuitively, referring to fig. 5, fig. 5 is a schematic structural diagram of various subtrees extracted for matching according to one or more embodiments of the present specification, in fig. 5, when a corresponding subtree of a difference partial subtree is found in a target API, both T3 and T4 in the target API may be recorded as candidate subtrees, and then a program slice is used to obtain a context partial feature corresponding to each matching result, and add the obtained context feature to its corresponding matching result, so as to form a final feature matching candidate set, where a part of contexts is allowed to overlap, that is, a certain code statement may serve as a context of multiple matching results.

For a certain difference part of feature subtrees, if a plurality of matched subtrees are found by the target API, redundant mismatching is likely to exist, all matched feature subtrees can be filtered in the subsequent matching stage, and finally, the maximum value is kept as correct matching through similarity calculation. In fig. 5, T3 and T4 are two different matching results, and there is necessarily at least one mismatch, where T4 is an incorrect match, and then it may be assumed that T3 is a subtree corresponding to the difference portion, and obtain a corresponding context feature, and then calculate a similarity S1 of the feature, and similarly, also obtain a feature starting from T4, and calculate a second similarity S2. And finally comparing the sizes of S1 and S2, and keeping a larger value as a final characteristic similarity result. Finally, since S2 is lower than S1, T4 matches are filtered as mismatches.

In one or more embodiments of the present specification, after the context is obtained in the foregoing manner, it may still be impossible to correctly determine, and the AST may still be dissimilar due to other noise influences, for example, the repair segment in the target file is moved and then obtained in a calling manner, or vulnerability repair operations with the same semantics are performed. In the above noise, the situation of the movement of the repair segment can be obtained through inter-API analysis, but in order to improve the matching efficiency, the target file is preprocessed, only the corresponding API is reserved, and cross-API analysis cannot be realized. For semantically identical repair updates, however, successful matching may be difficult to achieve in such cases, since the problem of semantic identity between different code fragments is difficult to define.

In addition, note that there is another very common noise in this scenario, namely, the combination and split of definition and usage, that is, in the target API, the definition and usage of bug fixing fragments and some variables in the relevant context may be integrated, or a new variable may be assigned to a local value in the original code statement. This may be caused by the decompiling process of the third-party software library, or may be caused by the third-party software library developer modifying the code or the host project developer modifying the code for the second time. For example, in the target API in fig. 4 (b), line 208 corresponds to line 314 in fig. 4 (a), however, since the original actionMethod variable is directly replaced by the original definition of the variable, i.e., name.substring (enumeration + 1), in the target API, this line 208 is not similar to the line 314. As shown in fig. 5, where T3 in (d) in fig. 5 corresponds to the 208 th row, (a) and (b) in fig. 5 correspond to the 314 th row, respectively, and (c) in fig. 5 is a context part sub-tree obtained, which corresponds to the 302 th row in fig. 4. The MIA and MI nodes are connection points of actionMethod variables in the target API, and respectively correspond to the MIA node in T1 and the MI node in T5, and if two specific nodes T1 and T5 in the graph are not connected in the matching process (the specific nodes are two nodes at two ends of the vertical dashed line), matching of T3 and T1 or T5 is difficult to succeed. A more accurate match with T3 can only be made after T1 and T5 are connected by the dashed line as shown.

In view of the problems presented in the above two paragraphs, the present application further provides a definition and usage fusion scheme, so that the definition and usage of variables in the target API and the APIs before and after bug fix are consistent. The fusion scheme comprises the following steps: and corresponding structures of the AST of the codes of the API and/or the AST of the codes of the repaired API are adapted to the vulnerability, and the definition nodes and the use nodes are re-fused in the same subtree by constructing the virtual edges so as to keep the consistency with the corresponding structures. Taking fig. 4 as an example, the definition of the 302 th row and the use of the 314 th row in (a) in fig. 4 may be merged by connecting corresponding nodes through a virtual edge. It should be noted that, the difference part feature and the context feature are uniformly described by using the same tree in the following, and matching is performed by using a similarity algorithm of the tree, and in order to maintain the tree structure, a plurality of virtual edges may be added to implement connection, so as to reasonably improve the structural consistency.

S208: extracting respective AST features for the target API to match the differential AST features and the context AST features.

In one or more embodiments of the present specification, a tree model (feature tree) integration mode is proposed to uniformly describe the above features so as to be used for matching. The method specifically comprises the following steps: respectively determining the different AST characteristics and the context AST characteristics corresponding to the vulnerability API and the repaired API, respectively connecting subtrees representing the different AST characteristics and subtrees representing the context AST characteristics by constructing a virtual root node (virtual root node), respectively generating corresponding characteristic trees for the vulnerability API and the repaired API, respectively generating corresponding characteristic trees for the target API according to the corresponding AST characteristics, and comparing the similarity of the characteristic trees according to the generated characteristic trees to determine whether the target API is successfully matched with the vulnerability API or successfully matched with the repaired API.

More intuitively, this is explained in connection with fig. 6. FIG. 6 is a tree model description diagram of a feature provided in one or more embodiments of the present disclosure. After the difference AST feature and the context AST feature are obtained respectively, in order to uniformly represent the subtrees by a tree, a virtual root node may be created, and the subtrees may be uniformly connected (directly connected to the virtual root node, or indirectly connected to the virtual root node through other virtual edges that are not directly connected to the virtual root node, which is because it is desired to reasonably improve structural consistency according to facts to help more reliably perform matching) to the root node, so as to form a feature tree, as shown in fig. 6. Fig. 6 (a) shows a tree structure formed for vulnerability characteristics (including a difference AST characteristic and a context AST characteristic of a vulnerability API), fig. 6 (b) shows a tree structure formed for vulnerability repaired characteristics (including a difference AST characteristic and a context AST characteristic of a repaired API), fig. 6 (c) shows a tree structure extracted for a target API, black nodes and white nodes respectively show characteristic subtrees to which vulnerability repair difference parts are deleted and added, gray nodes represent subtrees of a context part, dotted

lines

1, 2, and 3 show correspondence of the subtrees, three characteristic trees Treev, treep, and Treet are formed after connection with a virtual root node, respectively correspond to the vulnerability characteristics, the repaired characteristics, and the target API characteristics, and it is necessary to determine whether Treet is integrally similar to Treev and determine whether Treet is more similar to Treev or Treep subsequently.

Further, it has been mentioned above that a fusion scheme is defined and used, to which an application adaptation can be made on the tree model. The AST description corresponding to the example of fig. 4 is shown in fig. 5, where the T3 portion (corresponding to row 208 in fig. 4) of fig. 5 (d) can be regarded as a combination of T1 and T5. If T1 and T5 are represented in parallel only by adding the virtual root node, the similarity to T3 is greatly reduced, because the MI node in T3 is connected to the position of the MIA, so a virtual edge is required to be constructed here from T1 to T5 for connecting the definition and use of the actionMethod variable part. In the tree model of fig. 6, an edge of a dotted line part may be constructed for (a) in fig. 6, and in order to maintain the structure of the tree, the edge constructed by adding the virtual root node originally needs to be removed to improve the similarity with the target API.

In one or more embodiments of the present specification, after representing the corresponding AST feature, difference AST feature, and context AST feature as the plurality of feature trees described above, the similarity between the feature numbers is calculated using the edit distance of the trees to complete matching. At present, a variety of edit distance algorithms exist, but whatever algorithm is only considering consumption corresponding to edit operation, and not considering weight corresponding to nodes, however, in a vulnerability repair scene, different nodes often have different importance degrees. Taking the filtering user input as an example, the repairing of the vulnerability often performs a series of filtering operations around the variables input by the storage user, so that the nodes corresponding to the variables input by the user are more important, and accordingly, higher weights need to be given, thereby obtaining a heuristic assumption: in a bug fix, a user-defined lexical unit (a lexical unit not in Java language, such as a user-defined variable, parameter, method, class, etc.) has a greater number of variations in the fix, which means that the greater the importance of the lexical unit? Through a large amount of statistical investigation of vulnerability repair codes, most of samples are found to meet the assumption, and when a repair patch has more modification operations on a lexical unit, the lexical unit is often the key point of vulnerability repair attention, so that the conclusion is confirmed.

Based on the above-mentioned elicitation, in the present solution, a node is considered to be assigned with a weight, when two trees are similar, the corresponding sub-trees should also be similar, and after the weight is assigned, the similarity of the tree taking a certain node as the root should be formed by combining the self-weight and the similarity of the sub-trees. According to conventional thinking, nodes closer to the root are generally more important, so the ratio of the total weight occupied should be larger, and nodes deeper (especially leaf nodes) have less influence on the overall similarity. However, in the scenario of the present application, this is not the case, in the AST of the present solution, the leaf nodes only store corresponding lexical unit information, the non-leaf nodes represent grammatical relationships between the lexical units, and because the node types are limited, the probability of accidental matching of the nodes increases, so the node weight of this part should not be set too high. The leaf nodes store the lexical units defined by the user and are more unique, so that the matching condition of the leaf nodes is more important to pay attention.

Therefore, weights can be set for the designated leaf nodes according to the difference between the appearance conditions of the designated leaf nodes in each feature tree before and after the bug API is repaired, other nodes can be set without weights or default weights, and then the editing distances among the feature trees are calculated according to the weights so as to determine whether the target API is successfully matched with the bug API or successfully matched with the repaired API.

The degree of difference between the occurrence numbers (the larger the difference, the larger the degree of importance is considered) may be made to be in positive correlation with the weight set correspondingly, i.e., the larger the difference, the higher the weight may be relatively. The designated leaf node is, for example, a leaf node of a user-defined lexical unit, and the occurrence conditions include, for example, the occurrence times and the occurrence frequency. Taking the number of occurrences as an example, an exemplary node weight calculation formula is as follows:

Weight(Node)＝|Numofv(Node)-Numofp(Node)|；

wherein Node represents a designated leaf Node, the weight of other nodes is set to 1, numofv (Node) by default, for example, represents the number of times of occurrence of the Node in the API before bug fixing, and the corresponding Numofp (Node) represents the number of times of occurrence of the Node in the API after bug fixing.

Further, a zhang & shashashaha edit distance algorithm is used as a basic algorithm, and improvement (mainly involving calculation of edit consumption and determination of a threshold) is performed on the basis of the weight to enable the algorithm to be adapted to the vulnerability matching scene of the scheme.

First, the zhang & shashashaha algorithm involves three types of node editing operations: node addition, node deletion, and node replacement. Respectively, indicating that a node needs to be added, deleted, or replaced so that it can be switched from one tree to another. The operation costs in the zhang & shashashaha algorithm are irrelevant to the node weight, so that the weight adaptation needs to be carried out by adjusting the acquisition mode of the editing operation cost, and the nodes with higher weights need higher costs when being subjected to one-time addition, deletion and replacement operation, so that the loss or redundancy of the nodes has larger influence. The operational consumption for a simple add or delete node can be adjusted as follows:

cost (add)/Cost (del) = Weiht (Node) × Costop; the slashes here indicate "or".

The stop corresponds to the edit consumption corresponding to the current operation type (adding/deleting nodes), and the user can adjust the operation according to the user's own needs, generally because the adding and deleting operations are mutually reversible, the operation costs are also set to be equal, set to be equal and default to 1, and can also be adjusted additionally according to the effect. However, unlike the case where the addition and deletion of the node only involves a single node, there may exist two weight values for node replacement, which are respectively the node weight before replacement and the node weight after replacement, and are respectively denoted as Weiht (node del) and Weiht (node add), which requires further processing to determine the editing consumption of the replacement operation.

In order to select the consumption Cost of the appropriate replacement operation, four selection schemes are proposed and verified through experiments, namely, the maximum value, the minimum value, the average value and the sum of the weights of two nodes are respectively selected, namely, the consumption Cost (replace) of the replacement node is respectively set as follows:

Cost(replace)＝Max(Weiht(Nodedel),Weiht(Nodeadd))*Costop；

Cost(replace)＝Min(Weiht(Nodedel),Weiht(Nodeadd))*Costop；

Cost(replace)＝(Weiht(Nodedel)+Weiht(Nodeadd))/2*Costop；

Cost(replace)＝(Weiht(Nodedel)+Weiht(Nodeadd))*Costop；

the effect is relatively good when the fourth formula is selected through experiments, the highest accuracy can be obtained when the weight value to be replaced is regarded as the sum of the node weights before and after replacement, and the sum can be used as reference. At this time, the replacement operation of the node is equivalent to the operation of deleting and then adding the node, namely, the editing operation in the zhang & shashashashashaha algorithm is simplified into the operation of only adding the node and deleting the node.

After the editing cost definition of the editing operation is completed, the editing cost in the zhang & shashashaha algorithm can be replaced by the defined editing cost, and then a new editing distance value can be obtained. However, there is no direct correspondence between edit distance and similarity, because here the size of the matching tree also needs to be considered. The application considers that the similarity judgment between two trees with more nodes should allow more node differences, and the corresponding editing distance threshold for judging similarity should be set to be larger. In order to have a more intuitive understanding, it is not assumed that two trees to be matched only contain one node, and the difference of the node generates a small numerical value of the editing distance, however, it cannot be stated that the two trees are similar because all nodes participating in matching are different. Therefore, normalization processing needs to be performed on the obtained editing distance according to the scale of the tree, in the scheme, the denominator for normalization is the sum of all node weights of the two trees, and the normalized value is used as a measure index reference of the similarity to obtain the following similarity calculation formula:

wherein, weightOfNodes (T1) and WeightOfNodes (T2) respectively represent the total weight values of the T1 and T2 nodes. By calculating Sim (Treev, treet) and Sim (Treep, treet) and calculating the relative sizes of the Sim and the Treev, the target API and the API before and after bug fixing can be judged to be more similar.

However, comparing the relative sizes is not enough to determine the existence of the vulnerability, and even if the input method is completely unrelated to the vulnerability API, some similarity may still be generated, but the similarity value is often low, so when the similarity values between the target API and the APIs before and after vulnerability repair are too low, the target API should not be determined to be related to the vulnerability, and at this time, the determination of the relative distance size is no longer made. Therefore, a similarity threshold t needs to be set for filtering APIs unrelated to the vulnerability, and when Sim (Treev, treet) < t and Sim (Treep, treet) < t, the target API can be considered unrelated to the vulnerability API. In order to obtain a proper threshold value, experimental verification is also carried out, and the threshold value is respectively selected from 9 values of 0.1 to 0.9 according to the accuracy of 0.1. Experimental results show that relatively good matching results can be obtained when the similarity threshold is set to 0.5, which can be used as a reference.

S210: and determining whether the target API has risks according to the matching result.

In one or more embodiments of the present description, a target API is risky if it matches a vulnerability API successfully. However, in practical applications, whether the target API itself is risky or not, it is still necessary to see whether the target API needs to be used, and if the target API itself is risky but not used, it is currently relatively safe. Therefore, whether the target API is the unrepaired vulnerability API can be determined according to the matching result, whether at least one call chain to the target API exists in the project engineering calling the software to be detected can be judged, and if the two judgment results are yes, the target API is determined to have call risk.

Based on the same idea, one or more embodiments of the present specification further provide apparatuses and devices corresponding to the above-described method, as shown in fig. 7 and fig. 8.

Fig. 7 is a schematic structural diagram of a risk detection apparatus for a software bug according to one or more embodiments of the present specification, where the apparatus includes:

the detectable API extracting module 702 obtains a code of the software to be detected, and extracts an API in the code;

the to-be-matched API determining module 704 searches a set vulnerability database according to the APIs in the codes to determine a target API which may have risks in the APIs in the codes, acquires a vulnerability API corresponding to the target API in the vulnerability database, and repairs the repaired API after the vulnerability API is repaired;

an API feature extraction module 706 that extracts a differential AST feature between the code of the vulnerability API and the code of the repaired API, and a context AST feature corresponding to the differential AST feature;

an API feature matching module 708 that extracts a corresponding AST feature for the target API to match the differential AST feature and the contextual AST feature;

the API risk determining module 710 determines whether the target API has a risk according to the matching result.

Optionally, the API risk determining module 710 determines, according to the matching result, whether the target API is a vulnerability API that has not been repaired;

if yes, judging that at least one calling chain to the target API exists in the project engineering calling the software to be detected, and determining that the target API has a calling risk for the project engineering.

Optionally, the API feature extraction module 706 compares the AST of the code of the vulnerability API with the AST of the code of the repaired API to obtain a deleted subtree node set and an added subtree node set that reflect corresponding repair changes;

and determining the AST characteristics of the difference abstract syntax tree between the codes of the vulnerability API and the codes of the repaired API according to the deleted subtree node set and the added subtree node set.

Optionally, the API feature extraction module 706 is configured to, for multiple nodes in the deleted subtree node set and the added subtree node set, trace back common ancestor nodes of the multiple nodes;

judging whether the common ancestor node represents that the nodes are in the same program statement or not;

and if so, generating an AST sub-tree comprising the nodes by taking the common ancestor node as a root node, and taking the AST sub-tree as at least partial difference abstract syntax tree AST characteristics between the codes of the vulnerability API and the codes of the repaired API.

Optionally, the API feature matching module 708 is configured to match a corresponding sub-tree in the AST of the code of the target API according to the deleted sub-tree node set and the added sub-tree node set;

extracting a context AST feature in AST of the code of the target API for the subtree successfully matched;

and taking the at least one subtree with successful matching and the corresponding extracted context AST feature as the corresponding AST feature extracted for the target API.

Optionally, the API feature extraction module 706 determines a variable involved in the repairing of the AST of the code of the repaired API compared to the AST of the code of the vulnerability API;

and acquiring a definition node of the variable and a use node of the variable later in the AST of the code of the vulnerability API and/or the AST of the code of the repaired API, and determining the context AST characteristic of the difference AST characteristic according to the definition node of the variable and the use node of the variable.

Optionally, the API feature extraction module 706 is adapted to re-fuse the definition node and the usage node in the same sub-tree by constructing a virtual edge according to a corresponding structure of the AST of the code of the vulnerability API and/or the AST of the code of the repaired API, so as to maintain consistency with the corresponding structure.

Optionally, the API feature matching module 708 determines a difference AST feature and a context AST feature corresponding to the vulnerability API and the repaired API, respectively;

connecting subtrees representing the different AST characteristics and subtrees representing the context AST characteristics by constructing a virtual root node, and respectively generating corresponding characteristic trees for the vulnerability API and the repaired API;

generating a corresponding feature tree for the target API according to the corresponding AST feature;

and according to the generated feature tree, comparing the similarity of the feature tree, and determining whether the target API is successfully matched with the vulnerability API or successfully matched with the repaired API.

Optionally, the API feature matching module 708 representing the respective AST features, the differential AST features, the contextual AST features as a plurality of feature trees;

setting weights for the designated leaf nodes according to the difference between the appearance conditions of the designated leaf nodes in the feature tree before and after the bug API is repaired;

and calculating the editing distance between the plurality of feature trees according to the weight so as to determine whether the target API is successfully matched with the vulnerability API or successfully matched with the repaired API.

Optionally, the designated leaf node is a leaf node of a user-defined lexical unit;

the occurrence condition is occurrence times, and the difference degree between the occurrence times and the weight which is correspondingly set are in a positive correlation relationship which is set.

Optionally, the software to be detected belongs to a third-party software library.

Fig. 8 is a schematic structural diagram of a risk detection device of a software vulnerability, provided in one or more embodiments of the present specification, where the device includes:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein,

The processor and the memory may communicate via a bus, and the device may further include an input/output interface for communicating with other devices.

Based on the same idea, one or more embodiments of the present specification further provide a non-volatile computer storage medium corresponding to the method in fig. 2, and storing computer-executable instructions configured to:

extracting a difference AST feature between code of the vulnerability API and code of the repaired API, and a context AST feature corresponding to the difference AST feature;

extracting respective AST features for the target API to match the differential AST features and the context AST features;

The above description is merely one or more embodiments of the present disclosure and is not intended to limit the present disclosure. Various modifications and alterations to one or more embodiments of the present description will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement or the like made within the spirit and principle of one or more embodiments of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A risk detection method of a software vulnerability includes:

acquiring a code of software to be detected, and extracting an Application Programming Interface (API) in the code;

searching in a set vulnerability database according to the API in the code to determine a target API which may have risks in the API in the code, and acquiring a vulnerability API corresponding to the target API in the vulnerability database and a repaired API obtained after the vulnerability API is repaired;

extracting a difference Abstract Syntax Tree (AST) feature between a code of the vulnerability API and a code of the repaired API, and a context AST feature corresponding to the difference AST feature;

and determining whether the target API has risks or not according to the matching result.

2. The method according to claim 1, wherein the determining whether the target API is at risk according to the matching result specifically includes:

determining whether the target API is a vulnerability API which is not repaired according to the matching result;

3. The method of claim 1, wherein extracting difference Abstract Syntax Tree (AST) features between the code of the vulnerability API and the code of the fixed API comprises:

comparing the AST of the codes of the vulnerability APIs with the AST of the codes of the repaired APIs to obtain a deleted sub-tree node set and an added sub-tree node set which reflect corresponding repair changes;

4. The method of claim 3, wherein determining the AST features of the differentiated abstract syntax tree between the code of the vulnerability API and the code of the repaired API based on the deleted subtree node set and the added subtree node set comprises:

for a plurality of nodes in the deleted subtree node set and the added subtree node set, tracing the common ancestor nodes of the nodes;

and if so, generating an AST sub-tree comprising the nodes by taking the common ancestor node as a root node, and taking the AST sub-tree as the AST characteristic of the difference abstract syntax tree between the codes of the vulnerability API and the codes of the repaired API.

5. The method of claim 3, wherein said extracting the corresponding AST features for the target API specifically comprises:

matching a corresponding sub-tree in the AST of the code of the target API according to the deleted sub-tree node set and the added sub-tree node set;

and taking the at least one subtree with successful matching and the corresponding extracted contextual AST features as corresponding AST features extracted for the target API.

6. The method of claim 1, wherein said extracting the contextual AST features corresponding to the differential AST features comprises:

determining variables involved in the repairing of the AST of the code of the repaired API compared with the AST of the code of the vulnerability API;

and acquiring a definition node of the variable and a use node of the variable later in the AST of the code of the vulnerability API and/or the AST of the code of the repaired API, and determining the contextual AST characteristic of the difference AST characteristic according to the definition node.

7. The method of claim 6, said determining a contextual AST feature of said differential AST feature based thereon, comprising:

and the AST of the code adapted to the vulnerability API and/or the corresponding structure of the AST of the code of the repaired API re-fuse the definition node and the use node in the same subtree by constructing a virtual edge so as to keep consistency with the corresponding structure.

8. The method of claim 1, said matching with said differential AST feature and said contextual AST feature, comprising:

respectively determining the difference AST characteristics and the context AST characteristics corresponding to the vulnerability API and the repaired API;

9. The method of claim 1, said matching with said differential AST feature and said contextual AST feature, comprising:

representing the respective AST features, the differential AST features, the contextual AST features as a plurality of feature trees;

10. The method of claim 9, wherein the designated leaf node is a leaf node of a user-defined lexical unit;

11. The method according to any one of claims 1 to 10, wherein the software to be detected belongs to a third-party software library.

12. A risk detection device for software vulnerabilities, comprising:

13. The apparatus of claim 12, the API risk determination module to determine whether the target API is a vulnerability API that has not been fixed according to a result of the matching;

if yes, judging that at least one call chain to the target API exists in the project engineering for calling the software to be detected, and determining that the target API has a call risk for the project engineering.

14. The apparatus of claim 12, the API feature extraction module to compare the AST of the code of the vulnerability API and the AST of the code of the repaired API to obtain a set of deleted subtree nodes and a set of added subtree nodes that reflect corresponding repair changes;

15. The apparatus of claim 14, said API feature extraction module, for a plurality of nodes in said set of deleted subtree nodes and added subtree nodes, tracing common ancestor nodes of said plurality of nodes;

16. The apparatus of claim 14, said API feature matching module to match a corresponding sub-tree in the AST of the code of the target API based on the set of deleted sub-tree nodes and the set of added sub-tree nodes;

extracting context AST features from AST of codes of the target API for the subtrees with successful matching;

17. The apparatus of claim 12, the API feature extraction module to determine variables involved in fixing the AST of the code of the repaired API as compared to the AST of the code of the vulnerability API;

18. The apparatus of claim 17, the API feature extraction module to adapt corresponding structures of the AST of the code of the vulnerability API and/or the AST of the code of the repaired API to re-fuse the definition node and the usage node in a same sub-tree by constructing a virtual edge to maintain consistency with the corresponding structures.

19. The apparatus of claim 12, the API feature matching module to determine a differential AST feature and a contextual AST feature corresponding to the vulnerability API, the repaired API, respectively;

20. The apparatus of claim 12, the API feature matching module to represent the respective AST feature, the differential AST feature, the contextual AST feature as a plurality of feature trees;

21. The apparatus of claim 20, wherein said designated leaf node is a leaf node of a user-defined lexical unit;

22. The apparatus according to any one of claims 12 to 21, wherein the software to be detected belongs to a third-party software library.

23. A risk detection device for a software vulnerability, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to cause the at least one processor to: