CN117992966A

CN117992966A - Vulnerability detection method, model training method and corresponding devices

Info

Publication number: CN117992966A
Application number: CN202410123360.4A
Authority: CN
Inventors: 陈思依; 曾九天; 何君尧; 李奇
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2024-01-29
Filing date: 2024-01-29
Publication date: 2024-05-07

Abstract

The embodiment of the application discloses a vulnerability detection method, a vulnerability detection device, vulnerability detection equipment and a computer-readable storage medium. The main technical scheme comprises the following steps: performing static analysis on a target program to obtain at least one code execution path, wherein the code execution path comprises calling relations among functions in a code of the target program; code segmentation is carried out on the code execution path to obtain a code segment set corresponding to the code execution path; classifying each code segment contained in the code segment set by using an authentication classification model to obtain a classification result of whether each code segment is an authentication segment or not, wherein the authentication classification model is obtained by training in advance based on a machine learning model; and determining whether the target program has unauthorized loopholes according to the classification result. The application classifies the code fragments by using the authentication classification model, determines whether the target program has unauthorized loopholes based on the classification result, and realizes the security detection of the unauthorized loopholes of the target program.

Description

Vulnerability detection method, model training method and corresponding devices

Technical Field

The present application relates to the field of data security technologies, and in particular, to a vulnerability detection method, a model training method, and a corresponding device.

Background

With the rapid development of internet technology and application markets, the security vulnerability risk of software applications is increasingly prominent while software applications bring a great deal of convenience. The security hole not only causes the leakage of the user privacy, but also can destroy the normal running of the software application. Security vulnerabilities have become an important factor limiting the development of applications.

Unauthorized access vulnerability refers to the execution of access to functions requiring permission without request authorization, which is usually caused by the defect of the application function, no authentication and improper security configuration. Unauthorized access holes have become one of the most common holes in recent years, which are particularly difficult to detect due to inadequate holes and authorization.

Disclosure of Invention

In view of the above, the present application provides a vulnerability detection method, a model training method and a corresponding device, so as to detect unauthorized vulnerabilities.

The application provides the following scheme:

In a first aspect, a vulnerability detection method is provided, the method including:

Performing static analysis on a target program to obtain at least one code execution path, wherein the code execution path comprises calling relations among functions in a code of the target program;

Code segmentation is carried out on the code execution path to obtain a code segment set corresponding to the code execution path;

Classifying each code segment contained in the code segment set by using an authentication classification model to obtain a classification result of whether each code segment is an authentication segment or not, wherein the authentication classification model is obtained by training in advance based on a machine learning model;

And determining whether the target program has unauthorized loopholes according to the classification result.

According to an implementation manner of the embodiment of the present application, the performing static analysis on the target program to obtain more than one code execution path includes:

performing static analysis on the target program to obtain an entry function and a sensitive operation function of the target program;

And constructing at least one code execution path from the entry function of the target program to the sensitive operation function, wherein the sensitive operation function comprises a function corresponding to a preset operation type.

According to an implementation manner in the embodiment of the present application, performing code segmentation on the code execution path to obtain a code segment set corresponding to the code execution path includes:

code segmentation is carried out on the code execution path to obtain a plurality of code fragments;

determining the confidence level of the plurality of code fragments by utilizing a preset authentication function identification rule;

And selecting code segments meeting the preset confidence requirements from the plurality of code segments according to the confidence to form a code segment set corresponding to the code execution path.

According to an implementation manner of the embodiment of the present application, performing code segmentation on the code execution path to obtain a plurality of code segments includes:

performing segmentation of the code execution path with the granularity smaller than or equal to the function granularity to obtain a plurality of candidate code segments;

And screening the candidate code segments based on preset authentication characteristics to obtain the code segments.

According to an implementation manner of the embodiment of the present application, the authentication classification model includes: a first feature extraction network, a second feature extraction network, and a classification network;

The first feature extraction network is used for extracting feature representations of each element Token from the input code segment;

The second feature extraction network is used for carrying out rolling and pooling processing on the feature representations of the Token to obtain feature representations of the input code segments;

The classification network is used for obtaining a classification result of whether the code segment is an authentication segment by utilizing the characteristic representation of the input code segment.

According to an implementation manner of the embodiment of the present application, the method further includes at least one of the following:

If the target program is determined to have unauthorized holes, expanding training data adopted for training the authentication classification model by utilizing the code segment set; wherein the augmenting comprises: selecting a code segment from the code segment set as a code segment sample and labeling a non-authentication segment; and/or selecting a code segment from the code segment set, inserting an authentication function into the selected code segment, taking the code segment inserted with the authentication function as a code segment sample, and labeling the authentication segment;

if the target program is determined to have no unauthorized vulnerability, training data adopted by the authentication classification model is expanded and trained by using the code segments identified as the authentication segments; wherein the augmenting comprises: taking the code segment identified as the authentication segment as a code segment sample and labeling the authentication segment; and/or deleting the authentication function from the code segment identified as the authentication segment to serve as a code segment sample and labeling the code segment with a label of the non-authentication segment.

According to an implementation manner of the embodiment of the present application, determining whether the target program has an unauthorized vulnerability according to the classification result includes:

If the code segment set corresponding to the code execution path has an authentication segment, determining that the target program does not have an unauthorized vulnerability; or alternatively

If an authentication fragment exists in a code fragment set corresponding to the code execution path and the authentication fragment is before a sensitive operation function, determining that the target program does not have unauthorized loopholes, wherein the sensitive operation function comprises a function corresponding to a preset operation type; or alternatively

If the authentication fragments do not exist in the code fragment sets corresponding to all code execution paths corresponding to the target program, determining that the target program has unauthorized loopholes; or alternatively

If the authentication fragments do not exist in the code fragment sets corresponding to all the code execution paths corresponding to the target program, and the unauthorized loopholes are detected from the target program by at least one other loophole detection method, determining that the unauthorized loopholes exist in the target program.

In a second aspect, a vulnerability detection method is provided and applied to a cloud server, and the method includes:

Acquiring a file of a target program uploaded by a user terminal;

determining whether the target program has unauthorized loopholes according to the classification result;

And sending the information of whether the target program has the unauthorized loopholes to the user terminal.

In a third aspect, a model training method is provided, the method comprising:

acquiring training data comprising a plurality of training samples, the training samples comprising: the code segment sample and the corresponding label thereof, wherein the label indicates whether the corresponding code segment sample is an authentication segment;

Training an authentication classification model using the training data, wherein the authentication classification model comprises a first feature extraction network, a second feature extraction network, and a classification network;

The first feature extraction network is used for extracting feature representations of all Token from the input code segment samples;

The second feature extraction network is used for carrying out rolling and pooling processing on the feature representation of each Token to obtain the feature representation of the input code segment sample;

The classification network is used for obtaining a classification result of whether the code fragment sample is an authentication fragment or not by utilizing the characteristic representation of the input code fragment sample;

the training targets include: minimizing the difference between the classification result and the label corresponding to the code segment sample.

According to an implementation manner of the embodiment of the present application, the acquiring training data including a plurality of training samples includes at least one of the following:

Acquiring an application program determined to have an unauthorized vulnerability; after static analysis is carried out on the application program to obtain at least one code execution path, code segmentation is carried out on the code execution path to obtain a code fragment set corresponding to the code execution path; selecting a code segment from the code segment set as a code segment sample and labeling a non-authentication segment;

selecting a code segment from the code segment set, inserting an authentication function into the selected code segment, taking the code segment inserted with the authentication function as a code segment sample, and labeling the authentication segment;

acquiring a code segment determined as an authentication segment as a code segment sample and labeling the authentication segment;

The code segment determined as the authentication segment is deleted as a code segment sample and labeled with a non-authentication segment after the authentication function.

In a fourth aspect, a vulnerability detection apparatus is provided, the apparatus including:

The static analysis unit is configured to perform static analysis on the target program to obtain at least one code execution path, wherein the code execution path comprises calling relations among functions in the code of the target program;

the code segmentation unit is configured to conduct code segmentation on the code execution path to obtain a code fragment set corresponding to the code execution path;

The authentication classification unit is configured to classify each code segment contained in the code segment set by using an authentication classification model so as to obtain a classification result of whether each code segment is an authentication segment, wherein the authentication classification model is obtained by training in advance based on a machine learning model;

and the vulnerability identification unit is configured to determine whether the target program has unauthorized vulnerabilities according to the classification result.

In a fifth aspect, there is provided a model training apparatus, the apparatus comprising:

A sample acquisition unit configured to acquire training data including a plurality of training samples including: the code segment sample and the corresponding label thereof, wherein the label indicates whether the corresponding code segment sample is an authentication segment;

A model training unit configured to train an authentication classification model using the training data, wherein the authentication classification model comprises a first feature extraction network, a second feature extraction network, and a classification network; the first feature extraction network is used for extracting feature representations of all Token from the input code segment samples; the second feature extraction network is used for carrying out rolling and pooling processing on the feature representation of each Token to obtain the feature representation of the input code segment sample; the classification network is used for obtaining a classification result of whether the code fragment sample is an authentication fragment or not by utilizing the characteristic representation of the input code fragment sample; the training targets include: minimizing the difference between the classification result and the label corresponding to the code segment sample.

According to a sixth aspect, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the first aspects described above.

According to a seventh aspect, there is provided an electronic device comprising:

One or more processors; and

A memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the first aspects above.

According to the specific embodiment provided by the application, the application discloses the following technical effects:

1) According to the method and the device for detecting the unauthorized vulnerability of the target program, the target program is subjected to static analysis, the code execution path is subjected to code segmentation, the code fragment set is classified by using the authentication classification model, and whether the unauthorized vulnerability exists in the target program or not is determined based on the classification result, so that the security detection of the unauthorized vulnerability of the target program is realized.

2) According to the method, a machine learning model is used as an authentication classification model, after a code execution path obtained through static analysis is segmented, whether a segmented code segment is an authentication segment is determined by using the authentication classification model, and then an authentication result of an unauthorized vulnerability is obtained. Compared with a mode of manually conducting code audit, the method greatly reduces labor cost and improves the automation and intelligent degree of unauthorized access vulnerability detection.

3) The application carries out static analysis on the target program, constructs a code execution path from the entry function of the target program to the sensitive operation function, and the code execution path from the entry function to the sensitive operation function covers a risk path in the real execution process as much as possible, thereby improving the coverage rate and recall rate of vulnerability scanning.

4) According to the method, the code execution path is segmented, the code segments meeting the preset confidence requirement are screened out in the segmentation process, so that the number of the code segments classified by the authentication classification model is reduced, and the accuracy of unauthorized vulnerability detection is improved.

5) The application classifies the code fragments by using a pre-trained machine learning model as an authentication classification model, and the authentication classification model extracts, convolves and pools the characteristic representation of each Token in the code fragments to obtain the characteristic representation of the code fragments, so that the code fragments are understood from two layers of a program language and a natural language, and the accuracy and generalization of the identification of the authentication fragments are improved.

6) The application can further expand the training data adopted for training the authentication classification model by utilizing the detection result obtained by the unauthorized access vulnerability detection, improve the quantity and quality of the training data input, and further improve the classification effect of the authentication classification model, thereby continuously improving the accuracy of the unauthorized access vulnerability detection.

7) The application can further comprehensively judge whether the unauthorized loopholes exist or not by integrating other authentication modes on the basis of the classification result of the authentication classification model, thereby improving the success rate of finding the unauthorized loopholes.

Of course, it is not necessary for any one product to practice the application to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of a system architecture to which embodiments of the present application are applicable;

FIG. 2 is a flowchart of a method for vulnerability detection according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of an authentication classification model according to an embodiment of the present application;

fig. 4 is a flowchart of a method for vulnerability detection applied to a cloud server according to an embodiment of the present application;

FIG. 5 is a flow chart of a method for model training provided by an embodiment of the present application;

FIG. 6 is a schematic block diagram of a vulnerability detection apparatus according to an embodiment of the present application

FIG. 7 is a schematic block diagram of the model training apparatus according to one embodiment;

Fig. 8 is a schematic block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the application, fall within the scope of protection of the application.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

The traditional unauthorized access vulnerability detection often adopts a mode of manually auditing codes, and the mode needs a large amount of labor cost, faces to massive program codes, has limited security specialists and is difficult to widely apply.

In view of this, the present application provides a vulnerability detection method, apparatus, device and computer readable storage medium. To facilitate an understanding of the present application, a system architecture to which the present application is applicable will be described first with reference to fig. 1. FIG. 1 illustrates an exemplary system architecture to which embodiments of the application may be applied. As shown in fig. 1, the system architecture includes a user terminal, a model training device and a vulnerability detection device located at a server.

The vulnerability detection device is used for carrying out vulnerability detection on the target program in the mode provided by the embodiment of the application. In the embodiment of the application, the target program refers to an application program to be subjected to security detection. Application programs are developed by program developers and released to the public to meet the demands of the social group. In this embodiment, the target program may be provided by a developer before being online, or may be provided by a user or a third party. The object program involved may be an application program that can be installed and run on various operating systems. The operating system may include, but is not limited to, an android operating system, an IOS, a Windows operating system, and the like. The application program can be an online client application program of the HTML5 technology, and can also be various programs running on computer equipment. Electronic devices for target program operation include, but are not limited to, cell phones, PCs (Personal Computer, personal computers), tablet devices, notebook computers, palm computers (PDAs, personal DigitalAssistants), wearable devices (e.g., smart glasses, smart watches, etc.), vehicle terminals, and the like. The mainstream operating system is android and IOS, and the enumeration of applications in the embodiment does not limit the target program format in the present application.

The vulnerability detection device can be based on the classification result of the authentication classification model in the vulnerability detection process. The authentication classification model is used for realizing classification of the code fragments. The model training device can be used for training in advance by adopting the method provided by the embodiment of the application to obtain the authentication classification model for the vulnerability detection device.

As one of the possible ways, if the vulnerability detection device may be disposed on the server side. The user uploads the installation file of the target program to the server side through the user terminal, then the vulnerability detection device adopts the method provided by the embodiment of the application to carry out security detection on the target program, the detection result is returned to the user terminal, and the user terminal displays the detection result to the user. The vulnerability detection device and the user terminal may interact through a network, and the system shown in fig. 1 shows such an implementation.

The vulnerability detection device and the model training device may be provided in separate servers, may be provided in a server group formed by a plurality of servers, or may be provided in a cloud server. The cloud server is also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) PRIVATE SERVER service. Or may be provided in a computer terminal.

The user terminals may include, but are not limited to, such as: intelligent mobile terminals, intelligent home devices, wearable devices, PCs (Personal Computer, personal computers), etc. Wherein the smart mobile device may include, for example, a cell phone, tablet computer, notebook computer, PDA (Personal DigitalAssistant ), internet car, etc. The smart home devices may include smart home devices such as smart televisions, smart refrigerators, and the like that are capable of uploading installation files and exhibiting security detection results. Wearable devices may include devices such as smart watches, smart glasses, smart bracelets, VR (Virtual Reality) devices, AR (Augmented Reality, augmented Reality devices), mixed Reality devices (i.e., devices that can support Virtual Reality and augmented Reality), and so forth.

After the vulnerability detection device completes vulnerability detection, the mode of returning the detection result can include but is not limited to mail, short message, link, certain communication application, webpage and the like. The user who submits the target program and the user who obtains the detection result often coincide, but may not coincide. For example, the user who submits the target program is a program developer, and the user who obtains the detection result may be a security expert of a department, a user of an application, or the like.

It should be understood that the number of user terminals, model training means, vulnerability detection means, and authentication classification models in fig. 1 is merely illustrative. There may be any number of user terminals, model training means, vulnerability detection means, and authentication classification models, as required by the implementation.

Fig. 2 is a flowchart of a vulnerability detection method according to an embodiment of the present application, where the method may be executed by the vulnerability detection apparatus in the system shown in fig. 1. As shown in fig. 2, the method may include the steps of:

Step 201: and carrying out static analysis on the target program to obtain at least one code execution path, wherein the code execution path comprises calling relations among functions in the code of the target program.

Step 203: and performing code segmentation on the code execution path to obtain a code fragment set corresponding to the code execution path.

Step 205: and classifying each code segment contained in the code segment set by using an authentication classification model to obtain a classification result of whether each code segment is an authentication segment, wherein the authentication classification model is obtained by training in advance based on a machine learning model.

Step 207: and determining whether the target program has unauthorized loopholes according to the classification result.

From the above flow, the application performs static analysis on the target program to obtain at least one code execution path, performs code segmentation on the code execution path to obtain a code segment set, classifies the code segment set by using an authentication classification model to obtain a classification result of whether the code segment is authentication judgment, and determines whether the target program has an unauthorized vulnerability based on the classification result, thereby realizing the security detection of the unauthorized vulnerability of the target program.

Each step in the above-described flow and effects that can be further produced are described in detail below with reference to examples. It should be noted that the limitations of "first", "second", and the like in this disclosure are not limitations in terms of size, order, and number, but are merely intended to be distinguished by names. For example, a "first feature extraction network" and a "second feature extraction network" are used to distinguish between the different feature extraction networks.

First, the above-mentioned step 201, i.e. "performing static analysis on the target program to obtain at least one code execution path, where the code execution path includes call relations between functions in the code of the target program" will be described in detail in connection with the embodiments.

The static analysis is mainly carried out by adopting two modes of manual review and automatic detection. In the embodiment of the application, taking an automatic detection tool as an example, some static analysis detection tools exist at present, for example Coverity, klocwork, apkPecker, fortify, and as different static analysis automatic detection tools have different advantages and disadvantages and aim at objects, we can select according to the aimed target program. Static analysis does not execute the program itself, and performs a check analysis on the source code, bytecode, or binary execution form of the program.

Often, the object program is in the form of code, and code execution includes multiple function calls, single function use, no use of a function, or other means. The code execution paths, i.e., the logical order of execution of the code, may be considered to exist in code in a crotch-like structure. That is, there are a plurality of code execution paths in the target program, and the relationship between the code execution paths is similar to that between the forks. In the embodiment of the application, because the function call relationship exists in the development process of the unauthorized vulnerability, a code execution path possibly related to the unauthorized vulnerability needs to be found from the target program code first. The data structure of the code execution path is typically CALL GRAPH (call graph), which contains the call relations between functions starting from the entry function, each node in the path representing a function, and each directed edge representing the call relation between functions.

As an implementation manner of the application, static analysis can be carried out on the target program to obtain an entry function and a sensitive operation function of the target program; at least one code execution path from an entry function of the target program to a sensitive operation function is constructed, wherein the sensitive operation function comprises a function corresponding to a preset operation type.

The entry function is the starting point of execution in the target program, as in a Java program, and the entry function of the program (i.e., the starting execution position of the program) is the main method. The main method is a special method of a Java program, and when a Java program is run, a Java Virtual Machine (JVM) automatically searches for and executes the main method. The program starts execution from the main method and then executes the code in the method in a defined logical order.

The sensitive operations involved in the embodiments of the present application refer to operations that may cause security hazards, such as changing passwords, modifying user settings, inserting back door files, deleting user data, etc., which typically include inserting, deleting, modifying, etc., so that these preset types of operations may be regarded as sensitive operations.

For secure applications, sensitive operations must be authenticated, while the authentication of other operations has little impact on the security of the application. Thus, in an embodiment of the application, the analysis object begins with an entry function, including the code execution path of the sensitive operating function. The static analysis comprises tracking all possible execution paths, firstly determining an entry function and a sensitive operation function in the target program by static analysis of the target program, and constructing a code execution path from the entry function of the target program to the sensitive operation function to judge whether an unauthorized vulnerability exists. The resulting code execution path may be multiple.

The following describes the step 203 in detail, namely, "code segmentation is performed on the code execution path to obtain a code segment set corresponding to the code execution path" in connection with the embodiment.

As described above, after the code execution path including the call relationship between the functions in the code of the target program is obtained, if the authentication link is not reasonably set in the code execution path, that is, there is an unauthorized vulnerability, meaning that a sensitive operation can be performed from the entry function without any authentication processing, it is necessary to make a determination as to whether the authentication link is set in the code execution path, whether the authentication position is reasonable, whether the authentication manner is proper, or the like.

When judging whether the code execution path is provided with an authentication link, whether the code execution path comprises an authentication segment can be judged. As one of the realizable modes, after the code execution path is segmented to obtain a plurality of code segments, the code segments are judged one by one, that is, the code execution path is segmented, and all the obtained code segments form a code segment set so as to perform the processing of the subsequent steps.

However, most of the code segments included in one code execution path are not authentication segments, and in order to improve the processing efficiency of the subsequent steps, as another realizable manner, the code execution path may be subjected to code segmentation to obtain a plurality of code segments; determining the confidence coefficient of a plurality of code segments by utilizing a preset authentication function identification rule; according to the confidence, selecting code segments meeting the preset confidence requirement from a plurality of code segments to form a code segment set corresponding to the code execution path, wherein the code segments contained in the code segment set can be regarded as key code segments corresponding to the code execution path.

The authentication function is a function for verifying the authority or a function for directly verifying the authority in the target program in cooperation with user operation. In the embodiment of the application, different scores can be preset and distributed for keywords of common authentication functions in codes, such as Session-Cookie (Session-small text file) authentication functions, token authentication functions, IF (condition) authentication functions and the like. The authentication function codes may contain various keywords, such as id, user, equals and other words related to the identity of the user, and the characterization degree of the authentication function by different keywords is different, so the scores of the different keywords can be configured according to the needs. In the embodiment of the present application, the authentication function recognition rule may be that these keywords are matched with the code segment, and the confidence level of the code segment is obtained by performing, for example, accumulation, weighted average, etc., on the keywords obtained by the matching according to the occurrence times and the corresponding scores thereof. The higher the confidence, the greater the likelihood that the code segment is an authentication function.

After obtaining the confidence coefficient of the plurality of code segments, the code segments can be screened out according to the preset confidence coefficient requirement, and the preset confidence coefficient requirement can be configured according to the requirement, for example, the confidence coefficient is larger than or equal to a preset confidence coefficient threshold value or a plurality of code segments before ranking form a code segment set corresponding to a code execution path.

As one of the realizable modes, when the code execution path is subjected to code segmentation, the code execution path can be subjected to segmentation smaller than or equal to the granularity of a function, so as to obtain a plurality of code segments.

In the embodiment of the application, in order to improve the recognition fineness of the authentication fragment, the code execution path can be segmented smaller than or equal to the granularity of the function. Namely, as another realizable mode, when the code execution path is subjected to code segmentation, the code execution path can be subjected to segmentation smaller than or equal to the granularity of a function, so that a plurality of candidate code segments are obtained; and screening the plurality of candidate code segments based on preset authentication characteristics to obtain a plurality of code segments.

The code segments obtained by segmentation, namely the candidate code segments, do not need to be detected, a large number of segments which are irrelevant to authentication links, such as page interaction and rendering related codes, are included in a code execution path, and the code segments with authentication features can be screened out. The authentication feature may be, for example, a keyword or a phrase such as an ID, a user name, or a condition judgment if as the authentication feature. In addition, code segments which are irrelevant to authentication, such as code segments of log, print, error reporting, line number and the like, can be screened from a plurality of candidate code segments, and then screening is carried out based on authentication features, so that the code segments and workload of unauthorized vulnerability detection are further reduced.

The following describes the step 205 in detail by using an authentication classification model to classify each code segment included in the code segment set to obtain a classification result of whether each code segment is an authentication segment, where the authentication classification model is trained in advance based on a machine learning model.

The authentication fragment is a code fragment having an authentication function or containing an authentication function. Authentication is to determine whether the user has access rights to the interface or whether the user has rights to the corresponding resource or data query or operation. The innovation of reducing the manual audit is that the code segment set is classified by using the authentication classification model, the authentication classification model can understand the code segments from two aspects of programming language and natural language, so that various authentication features in the manual code audit process can be fully utilized by the authentication classification model, the probability that the authentication classification model discovers various authentication writing methods is further improved, and the cost of manual audit is reduced. And inputting the fragments of the code fragment set into an authentication classification model to obtain a classification result of whether each code fragment contains the authentication fragment.

As one of the possible ways, the structure of the authentication classification model may include, as shown in fig. 3: the system comprises a first feature extraction network, a second feature extraction network and a classification network.

The first feature extraction network is used for extracting feature representations of various Token (elements) from the input code segments.

The second feature extraction network is used for rolling and pooling the feature representation of each Token to obtain the feature representation of the input code segment.

The classification network is used for obtaining the classification result of whether the code segment is an authentication segment by using the characteristic representation of the input code segment.

The first feature extraction network may be implemented using a pre-trained language model, such as BERT (Bidirectional Encoder Representation from Transformers, a bi-directional coded representation based on transformations), XLNet (an autoregressive model that implements bi-directional context information through a permutation language model), and the like. The BERT is a bi-directional pre-training language model, and uses Transformer Encoder (transform encoder) as a model structure, so that the BERT can well utilize the context information to perform feature learning. XLNet is a BERT-like model, a more generalized autoregressive pre-training model.

In an embodiment of the application, the code segments are actually text. Each Token of a text refers to an element constituting the text. For a text, the text is segmented into a sequence of characters or words, and the characters or words, the initiator and the separator in the sequence of text are Token.

The feature representation of each element Token may be input to a second feature extraction network, where the feature representation of each Token is rolled and pooled to extract feature vectors to a high-dimensional code level. The second feature extraction network may include a CNN (Convolutional Neural Networks, convolutional neural network) layer and a pooling (pooling) layer.

In the embodiment of the application, the classification network can be a two-class network, and can be realized by adopting a full connection layer, a softmax layer and the like, and the classification result of the code segment can be obtained after the characteristic representation of the code segment is input. The classification result indicates whether the inputted code fragment is an authentication fragment. An authentication fragment refers to a code fragment having an authentication function, and may be understood to include an authentication function, be part of an authentication function, and the like.

The step 207 of determining whether the target program has an unauthorized bug according to the classification result is described in detail below in connection with the embodiments.

After the authentication classification model outputs the classification result of whether each code segment is an authentication segment, whether the target program is provided with an authentication link or not can be primarily determined, namely whether a vulnerability (unauthorized vulnerability) for sensitive operation is present or not.

As one of the realizable modes, if the authentication fragment exists in the code fragment set corresponding to the code execution path, determining that the target program does not have unauthorized loopholes. It should be noted that, if there are multiple code execution paths in the target program, it is necessary that all code segments corresponding to the code execution paths have authentication segments in combination, so as to determine that the target program does not have unauthorized holes.

As another more preferable implementation manner, if the authentication fragment exists in the code fragment set corresponding to the code execution path and the authentication fragment precedes the sensitive operation function, it is determined that the target program does not have an unauthorized vulnerability, and the sensitive operation function includes a function corresponding to a preset operation type.

In the embodiment of the application, the positions of the authentication fragments can be analyzed after the code fragment set determines that the authentication fragments exist. The target program should theoretically be authenticated before the sensitive operation of the user, so if the position of the authentication fragment in the target program code is located before the sensitive operation function, the target program can be determined to have no unauthorized loopholes, so as to meet the high risk detection requirement of the unauthorized loopholes.

As still another implementation manner, if the authentication fragments do not exist in the code fragment sets corresponding to all code execution paths corresponding to the target program, determining that the target program has an unauthorized vulnerability.

If the code segment sets corresponding to all the code execution paths do not have authentication segments, which means that the target program is not provided with an authentication link, the target program can be considered to have unauthorized loopholes to a certain extent.

As still another implementation manner, if the authentication fragments do not exist in the code fragment sets corresponding to all code execution paths corresponding to the target program, and the at least one other vulnerability detection method detects an unauthorized vulnerability from the target program, determining that the target program has the unauthorized vulnerability. The other at least one vulnerability detection method may be other methods capable of detecting whether an authentication segment exists in the code execution path, and if the authentication segment is not detected from the code execution path by the other vulnerability detection methods, it may be determined that an unauthorized vulnerability exists in the target program.

In the embodiment of the application, the fact that the authentication fragments do not exist in the code fragment sets corresponding to all code execution paths may be that an authentication classification model or a code layer cannot see whether an authentication link exists, so that other authentication detection modes, such as authentication modes of other fields including HSF (High SpeedFramework, high-speed service framework) authentication, SQL (Structured Query Language ) authentication, gateway authentication and the like, can be added, and the target program can be considered to have an unauthorized vulnerability if the authentication fragments do not exist in the code fragment sets.

Further, if it is determined that the target program has an unauthorized vulnerability, the training data adopted for training the authentication classification model can be expanded by using the code segment set; wherein the expanding comprises: selecting a code segment from the code segment set as a code segment sample and labeling a non-authentication segment; and/or selecting a code segment from the code segment set, inserting an authentication function into the selected code segment, taking the code segment inserted with the authentication function as a code segment sample, and labeling the authentication segment.

If the authorization loophole exists in the target program, the training data adopted by the training authentication classification model can be expanded by utilizing the code segment identified as the authentication segment; wherein the expanding comprises: taking the code segment identified as the authentication segment as a code segment sample and labeling the authentication segment; and/or deleting the authentication function from the code segment identified as the authentication segment to serve as a code segment sample and labeling the code segment with a label of the non-authentication segment.

By the expansion mode, the authentication classification model can continuously learn the characteristics of the new authentication fragments and the new non-authentication fragments, so that the classification effect of the authentication classification model is continuously improved, and the accuracy and recall rate of unauthorized vulnerability detection are further improved.

Fig. 4 is a flowchart of a method for vulnerability detection applied to a cloud server according to an embodiment of the present application, where, as shown in fig. 4, the method includes:

step 401: and acquiring the file of the target program uploaded by the user terminal.

In the embodiment of the application, the vulnerability detection device is arranged on the cloud server and is used as a vulnerability detection tool for a developer. The developer can upload the file (for example, the installation file) of the application program to be detected as the file of the target program through the user terminal to the vulnerability detection system in the cloud server.

Step 403: and carrying out static analysis on the target program to obtain at least one code execution path, wherein the code execution path comprises calling relations among functions in the code of the target program.

Step 405: and performing code segmentation on the code execution path to obtain a code fragment set corresponding to the code execution path.

As one of the realizable modes, code segmentation can be performed on the code execution path to obtain a plurality of code segments; determining the confidence coefficient of a plurality of code segments by utilizing a preset authentication function identification rule; according to the confidence, selecting code segments meeting the preset confidence requirement from a plurality of code segments to form a code segment set corresponding to the code execution path, wherein the code segments contained in the code segment set can be regarded as key code segments corresponding to the code execution path.

Step 407: and classifying each code segment contained in the code segment set by using an authentication classification model to obtain a classification result of whether each code segment is an authentication segment, wherein the authentication classification model is obtained by training in advance based on a machine learning model.

Step 409: and determining whether the target program has unauthorized loopholes according to the classification result.

The specific implementation process of the steps 403 to 409 may be referred to in the embodiment shown in fig. 2 for the relevant descriptions of the steps 201 to 207, which are not described herein.

Step 411: and sending the information of whether the target program has the unauthorized loopholes to the user terminal.

As one of the realizable modes, if the target program is identified to have the unauthorized vulnerability, the information of the unauthorized vulnerability can be sent to the user terminal and simultaneously sent to the detailed unauthorized vulnerability information of the user terminal. The information can provide reference for a developer to know the reason for the unauthorized loopholes, so that the target program is improved in a targeted manner.

By the method, a unified vulnerability detection tool can be provided for all developers on the side face of the cloud server, and the developers can conveniently upload the installation files of the developed application programs to the cloud server to perform vulnerability detection and acquire information of unauthorized vulnerabilities of the application programs.

The method provided by the embodiment of the application can be applied to various application scenes, for example, before a new application program is online, or before a new version of the application program is online, the method provided by the embodiment of the application can be adopted, the application program to be online is taken as a target program, the vulnerability detection is carried out on the target program by adopting the method flow provided by any embodiment, and the information about whether the target program has unauthorized vulnerability or not is obtained. If the target program has unauthorized holes, the developer can modify the application program, and then the modified application program adopts the method flow provided by any embodiment to detect the unauthorized holes of the target program again. If the target program does not have unauthorized holes, the new application program, the new version of the application program or the improved application program can be arranged to be online for the user to download and use.

FIG. 5 is a flowchart of a method for model training according to an embodiment of the present application, the method comprising:

Step 501: acquiring training data comprising a plurality of training samples, the training samples comprising: the code segment samples and their corresponding labels, the labels indicating whether the corresponding code segment samples are authentication segments.

As one of the realizations, some fragments of the authentication function may be manually configured as code fragment samples and labeled with authentication fragments, and some fragments of the non-authentication function may be manually configured as code fragment samples and labeled with non-authentication fragments.

As another implementation manner, an application program with an unauthorized vulnerability already definitely exists in other detection manners or the detection manner provided by the embodiment of the present application may be adopted, and after static analysis is performed on the application program to obtain at least one code execution path, code segmentation is performed on the code execution path to obtain a code segment set corresponding to the code execution path; selecting a code segment from the code segment set as a code segment sample and labeling a non-authentication segment; and/or selecting a code segment from the code segment set, inserting an authentication function into the selected code segment, taking the code segment inserted with the authentication function as a code segment sample, and labeling the authentication segment. Wherein any of the authentication function inserts may be randomly selected, preferably as rich as possible, to form a code segment sample, so that the authentication classification model can learn rich features to distinguish between authentication segments and non-authentication segments.

As yet another implementation manner, other detection manners or an application program in which the detection manner provided by the embodiment of the present application has definitely no unauthorized vulnerability may be adopted, and the code segment identified as the authentication segment is used as a code segment sample and labeled with the authentication segment; and/or deleting the authentication function from the code segment identified as the authentication segment to serve as a code segment sample and labeling the code segment with a label of the non-authentication segment.

In addition, other modes may be adopted, for example, after static detection is performed on the target application program to obtain at least one code execution path, code segmentation is performed on the code execution path to obtain a code segment set, and then a code segment is selected from the code segment set by using a preset recognition rule to be used as a code segment sample and labeled with a non-authentication segment or authentication judgment.

Other ways of obtaining training samples are also possible, not explicitly recited herein.

Step 503: training an authentication classification model by using training data, wherein the authentication classification model comprises a first feature extraction network, a second feature extraction network and a classification network; the first feature extraction network is used for extracting feature representations of all Token from the input code segment samples; the second feature extraction network is used for carrying out rolling and pooling processing on the feature representation of each Token to obtain the feature representation of the input code fragment sample; the classification network is used for obtaining a classification result of whether the code fragment sample is an authentication fragment or not by utilizing the characteristic representation of the input code fragment sample; the training targets include: the difference between the classification result and the label corresponding to the code segment sample is minimized.

The structure and principle of the authentication classification model are not described herein, and reference may be made to fig. 3 and the related description in the embodiment of fig. 3.

In the embodiment of the application, the model parameters can be updated in a gradient descent mode by utilizing the value of the loss function in each iteration according to the target construction loss function (such as cross entropy loss function) of the training until the preset training ending condition is met. The training ending condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, the number of iterations reaching a preset number of times threshold, etc.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

According to an embodiment of another aspect, a vulnerability detection apparatus is provided. FIG. 6 illustrates a schematic block diagram of a vulnerability detection apparatus, according to one embodiment. The apparatus 600 includes: a static analysis unit 601, a code segmentation unit 603, an authentication classification unit 605 and a vulnerability identification unit 607. Wherein the main functions of each constituent unit are as follows:

The static analysis unit 601 is configured to perform static analysis on the target program to obtain at least one code execution path, where the code execution path includes a call relationship between functions in the code of the target program;

The code segmentation unit 603 is configured to perform code segmentation on the code execution path to obtain a code segment set corresponding to the code execution path;

An authentication classification unit 605 configured to classify each code segment included in the code segment set by using an authentication classification model, so as to obtain a classification result of whether each code segment is an authentication segment, wherein the authentication classification model is obtained by training in advance based on a machine learning model;

The vulnerability identification unit 607 is configured to determine whether the target program has an unauthorized vulnerability according to the classification result.

As an implementation manner in the embodiment of the present application, the static analysis unit 601 may be specifically configured to: performing static analysis on the target program to obtain an entry function and a sensitive operation function of the target program; at least one code execution path from an entry function of the target program to a sensitive operation function is constructed, wherein the sensitive operation function comprises a function corresponding to a preset operation type.

As an implementation manner in the embodiment of the present application, the code slicing unit 603 may be specifically configured to: code segmentation is carried out on the code execution path to obtain a plurality of code fragments; determining the confidence coefficient of a plurality of code segments by utilizing a preset authentication function identification rule; and selecting code fragments meeting the preset confidence requirements from the plurality of code fragments according to the confidence, and forming a code fragment set corresponding to the code execution path.

As an implementation manner in the embodiment of the present application, the code slicing unit 603 may be specifically configured to: the code execution path performs segmentation smaller than or equal to the granularity of the function to obtain a plurality of candidate code segments; and screening the plurality of candidate code segments based on preset authentication characteristics to obtain a plurality of code segments.

As an implementation manner in the embodiment of the present application, the vulnerability identification unit 607 may be specifically configured to: if the authentication fragments exist in the code fragment set corresponding to the code execution path, determining that the target program does not have unauthorized loopholes; or if the authentication fragment exists in the code fragment set corresponding to the code execution path and the authentication fragment is in front of the sensitive operation function, determining that the target program does not have unauthorized loopholes, wherein the sensitive operation function comprises a function corresponding to a preset operation type; or if the authentication fragments do not exist in the code fragment sets corresponding to all code execution paths corresponding to the target program, determining that the target program has unauthorized loopholes; if the authentication fragments do not exist in the code fragment sets corresponding to all code execution paths corresponding to the target program, and the unauthorized loopholes are detected from the target program by at least one other loophole detection method, determining that the unauthorized loopholes exist in the target program.

As an implementation manner in the embodiment of the present application, the apparatus may further include a sample expansion unit (not shown in the figure), and may be configured to: if the target program is determined to have unauthorized holes, the training data adopted for training the authentication classification model is expanded by utilizing the code segment set. Wherein the expanding comprises: selecting a code segment from the code segment set as a code segment sample and labeling a non-authentication segment; and/or selecting a code segment from the code segment set, inserting an authentication function into the selected code segment, taking the code segment inserted with the authentication function as a code segment sample, and labeling the authentication segment.

As another implementation in an embodiment of the present application, the sample expansion unit may be configured to: if the target program is determined to have no unauthorized vulnerability, training data adopted by the authentication classification model is expanded and trained by using the code segments identified as the authentication segments; wherein the augmenting comprises: taking the code segment identified as the authentication segment as a code segment sample and labeling the authentication segment; and/or deleting the authentication function from the code segment identified as the authentication segment to serve as a code segment sample and labeling the code segment with a label of the non-authentication segment.

It should be noted that, the embodiment of the present application further provides a vulnerability detection device applied to a cloud server (not shown in the figure), and compared with the vulnerability detection device in fig. 6, the vulnerability detection device has two more units: the obtaining unit and the sending unit, and other units are identical to those in fig. 6, and are not described herein.

The acquiring unit is configured to acquire the file of the target program uploaded by the user terminal, and then provide the file to the static analysis unit 601 for static analysis.

And a sending unit configured to send information about whether the target program obtained by the vulnerability identification unit 607 has an unauthorized vulnerability to the user terminal. The method and function of the unit implementation are as above.

According to an embodiment of another aspect, a model training apparatus is provided. FIG. 7 shows a schematic block diagram of the model training apparatus according to one embodiment. The apparatus 700 includes:

A sample acquisition unit 701 configured to acquire training data including a plurality of training samples, the training samples including: the code segment samples and their corresponding labels, the labels indicating whether the corresponding code segment samples are authentication segments.

A model training unit 703 configured to train an authentication classification model using the training data, wherein the authentication classification model comprises a first feature extraction network, a second feature extraction network, and a classification network; the first feature extraction network is used for extracting feature representations of all Token from the input code segment samples; the second feature extraction network is used for carrying out rolling and pooling processing on the feature representation of each Token to obtain the feature representation of the input code fragment sample; the classification network is used for obtaining a classification result of whether the code fragment sample is an authentication fragment or not by utilizing the characteristic representation of the input code fragment sample; the training targets include: the difference between the classification result and the label corresponding to the code segment sample is minimized.

As an implementation manner in the embodiment of the present application, the sample acquiring unit 701 may be specifically configured to: acquiring training data comprising a plurality of training samples comprises:

Acquiring an application program determined to have an unauthorized vulnerability;

After static analysis is carried out on an application program to obtain at least one code execution path, code segmentation is carried out on the code execution path to obtain a code fragment set corresponding to the code execution path;

Selecting a code segment from the code segment set as a code segment sample and labeling a non-authentication segment; and/or selecting a code segment from the code segment set, inserting an authentication function into the selected code segment, taking the code segment inserted with the authentication function as a code segment sample, and labeling the authentication segment.

As another implementation manner in the embodiment of the present application, the sample acquiring unit 701 may be specifically configured to: acquiring a code segment determined as an authentication segment as a code segment sample and labeling the authentication segment; and/or the number of the groups of groups,

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

In addition, the embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the method of any one of the previous method embodiments.

And an electronic device comprising:

One or more processors; and

A memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.

Fig. 8 illustrates an architecture of an electronic device, which may include, inter alia, a processor 710, a video display adapter 811, a disk drive 812, an input/output interface 813, a network interface 814, and a memory 820. The processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820 may be communicatively coupled via a communication bus 830.

The processor 810 may be implemented by a general-purpose CPU, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solution provided by the present application.

The Memory 820 may be implemented in the form of ROM (Read Only Memory), RAM (RandomAccess Memory ), static storage, dynamic storage, etc. The memory 820 may store an operating system 821 for controlling the operation of the electronic device 800, and a Basic Input Output System (BIOS) 822 for controlling the low-level operation of the electronic device 800. In addition, a web browser 823, a data storage management system 824, a vulnerability detection device 825, and the like may also be stored. The vulnerability detection device 825 may be an application program that specifically implements the operations of the foregoing steps in the embodiment of the present application. In general, when implemented in software or firmware, the relevant program code is stored in memory 820 and executed by processor 810.

The input/output interface 813 is used to connect with an input/output module to realize information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Network interface 814 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).

Bus 830 includes a path for transferring information between components of the device (e.g., processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, and memory 820).

It is noted that although the above-described devices illustrate only the processor 810, video display adapter 811, disk drive 812, input/output interface 813, network interface 814, memory 820, bus 830, etc., the device may include other components necessary to achieve proper operation in an implementation. Furthermore, it will be appreciated by those skilled in the art that the apparatus may include only the components necessary to implement the present application, and not all of the components shown in the drawings.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer program product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The foregoing has outlined rather broadly the more detailed description of the application in order that the detailed description of the application that follows may be better understood, and in order that the present principles and embodiments may be better understood; also, it is within the scope of the present application to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the application.

Claims

1. A vulnerability detection method, the method comprising:

2. The method of claim 1, wherein performing static analysis on the target program to obtain more than one code execution path comprises:

3. The method of claim 1, wherein performing code segmentation on the code execution path to obtain a set of code segments corresponding to the code execution path comprises:

4. The method of claim 3, wherein code slicing the code execution path to obtain a plurality of code segments comprises:

5. The method of claim 1, wherein the authentication classification model comprises: a first feature extraction network, a second feature extraction network, and a classification network;

6. The method of claim 1 or 5, further comprising at least one of:

7. The method of any of claims 1 to 5, wherein determining whether the target program has an unauthorized vulnerability based on the classification result comprises:

8. The vulnerability detection method is applied to a cloud server, and is characterized by comprising the following steps:

Acquiring a file of a target program uploaded by a user terminal;

9. A method of model training, the method comprising:

The first feature extraction network is used for extracting feature representations of each element Token from the input code segment samples;

10. The method of claim 9, wherein the obtaining training data comprising a plurality of training samples comprises at least one of:

11. A vulnerability detection apparatus, the apparatus comprising:

12. A model training apparatus, the apparatus comprising:

13. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.

14. An electronic device, comprising:

One or more processors; and

A memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 10.