CN116663022A

CN116663022A - Scene threat modeling method based on multi-library fusion

Info

Publication number: CN116663022A
Application number: CN202310962171.1A
Authority: CN
Inventors: 谢朝海; 齐大伟; 谢朝战; 雷德诚; 李志勇
Original assignee: Shenzhen Secidea Network Security Technology Co ltd
Current assignee: Shenzhen Secidea Network Security Technology Co ltd
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-08-29
Anticipated expiration: 2043-08-02
Also published as: CN116663022B

Abstract

The invention discloses a scene threat modeling method based on multi-library fusion, which relates to the field of computer security and comprises a plurality of threat models consisting of knowledge base, vulnerability library and compliance library information, wherein each model has characteristic parameters and reflects basic attributes such as service scene, threat level, attack mode and the like. And then training by adopting a machine learning model, and adjusting the model parameters by comparing the similarity of the sample characteristics and the threat model characteristic parameters so as to reduce the error of the prediction scores and the actual scores. When a threat model most suitable for the current business environment needs to be selected, the machine learning model receives characteristics obtained through analysis of business source codes and data streams and a threat model library as input, outputs a prediction score, and selects an optimal threat model according to the score. The method of the invention improves the convenience and efficiency of threat modeling and strengthens the capability and flexibility of coping with the threat.

Description

Scene threat modeling method based on multi-library fusion

Technical Field

The invention relates to the field of computer security, in particular to a scene threat modeling method based on multi-library fusion.

Background

Threat modeling is a key element in information security, the goal of which is to identify potential threats, and the impact these threats may have on the system. The threat modeling techniques currently available mainly include the following:

1) Attack tree model: the attack tree model is a structured method, and the steps and paths of the attack are shown through a tree structure. It cannot respond to dynamic environments and cannot take into account the dependencies between the various attack paths, and thus may have limitations in use in complex real world environments.

2) STRIDE model: strand is a threat modeling method developed by Microsoft, including six types of threats, namely Spoofing (spafing), tampering (Tampering), repudiation (Repudiation), information disclosure (Information disclosure), denial of service (Denial of service) and elevated rights (Elevation of privilege). The STRIDE model can systematically take into account various threats, but also requires significant human involvement, and is costly and time intensive.

3) Security Cards (Security Cards): a security team is a design team-based approach that models by facilitating team discussion and thinking about potential threats. However, the effect of this approach is largely dependent on team member safety knowledge and experience, and there may be some uncertainty.

4) Automated threat modeling tool: some tools may automatically generate threat models, such as microsoft Threat Modeling Tool, but these tools often require detailed system information as input, which can be a difficult problem for some complex or incompletely informative systems.

5) Data-driven threat modeling: in recent years, with the development of big data and artificial intelligence, data-driven threat modeling approaches are also being explored gradually. This approach extracts potential threat patterns by analyzing a large amount of security event data. However, such methods typically require a large amount of annotation data and may not be effective against new, unknown threats.

From the above examples, it can be seen that although some progress has been made in the current threat modeling technology, there are still some challenges in dynamic coping, dependency modeling, human input, information acquisition, and new threat identification. This makes it necessary to find a threat modeling approach that is more automated, intelligent, and capable of coping with various scenarios.

Disclosure of Invention

The invention aims to provide a scene threat modeling method based on multi-library fusion so as to solve the problems in the background art.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a scene threat modeling method based on multi-library fusion comprises the following steps:

s1: constructing a threat model library, wherein the threat model library comprises a plurality of threat models, each threat model consists of a corresponding knowledge base, a vulnerability library and compliance library information and is provided with corresponding characteristic parameters, and the characteristic parameters reflect basic attributes of the threat models, including business scenes, threat levels and possible attack modes;

s2: training a machine learning model, wherein the training process of the machine learning model comprises: randomly selecting a threat model from a threat model library, selecting a sample from a service source code and a data stream in a training set for analysis, inputting the characteristics of the sample and the characteristic parameters of the threat model into a machine learning model for comparison, generating a prediction score by calculating the similarity between the characteristics of the sample and the characteristic parameters of the threat model, comparing the prediction score with actual scores provided by experts or historical data to obtain errors, and back-propagating and adjusting the model parameters with the aim of reducing the errors;

s3: when an optimal threat model is required to be selected from a threat model library according to an actual service source code and a data stream, the machine learning model receives characteristics obtained by analysis of the actual service source code and the data stream as one input, and takes one threat model library selected from the threat model library as the other input, and the machine learning model outputs a prediction score; and comparing the predictive scores corresponding to all threat models in the threat model library, and taking one or more threat model combinations with highest scores as threat models which are most suitable for the current business environment.

Preferably, in the machine learning model training process, a neural network is used for model training.

Preferably, the extracting characteristics from the service source code and the data stream in S2 and S3 includes:

analyzing the service source code by using a static code analysis tool, and extracting code characteristics in the service source code, wherein the code characteristics at least comprise: code structure, function call, API use, data structure;

and using a dynamic analysis tool to monitor and capture the service data flow in real time, and extracting the characteristics of the data flow, wherein the method at least comprises the following steps: transmission protocol, data format, data size, data transmission frequency;

the extracted code characteristics and data stream characteristics are subjected to feature coding, and the code structure and the data stream characteristics are converted into feature vectors which are easy to calculate and process.

Preferably, the method for calculating the similarity between the sample characteristic and the threat model characteristic parameter is any one of the following: a cosine similarity calculation method and an Euclidean distance calculation method.

Preferably, in S3, the order in which threat models are selected from the threat model library to be input into the machine learning model is based on a specific policy.

Preferably, the policy includes a preference for threat models that are better in historical cases, or preference for threat models in newly added warehouse.

Preferably, when a new threat model is added to the threat model library, the machine learning model is trained by adopting an incremental learning method.

The incremental learning process comprises the following steps:

s2.1: when a new threat model is added into the threat model library, extracting characteristic parameters of the new threat model and adding the characteristic parameters into a training set;

s2.2: inputting the new training set into a machine learning model which is trained, and carrying out forward propagation through the existing model to obtain a prediction score;

s2.3: comparing the predicted score with actual scores provided by expert or historical data to obtain errors, and performing back propagation with the aim of reducing the errors to adjust model parameters;

s2.4: repeating the steps of S2.2 and S2.3 until the update amplitude of the model parameters is smaller than a preset threshold value or the preset training times are reached, and ending the incremental learning process.

The invention has the advantages compared with the prior art that:

1. the invention can collect and integrate various threat models in advance by constructing the threat model library, including knowledge base, vulnerability base and compliance base information associated with the threat models. This allows the user to pick existing models directly from the library in the face of a specific business scenario without having to build models from scratch, greatly enhancing the convenience of operation.

2. The introduction of the machine learning model further improves the threat modeling efficiency. By training the machine learning model, the effects of various threat models under specific business scenes can be automatically evaluated and compared, and complex and error-prone comparison work is not needed manually.

3. The invention can automatically select the threat model which is most suitable for the current service environment according to the real-time service source code and data flow conditions. This allows users to make real-time threat assessments, more timely discover and address possible threats.

4. Through the predictive scoring of the machine learning model, a user can intuitively see the performance of various threat models in the current business environment and the relative advantages and disadvantages of the threat models, thereby more pertinently selecting and using the threat models.

5. When a new threat model is added, our machine learning model can be updated and optimized quickly by means of incremental learning so that it can better cope with the new threat model. This allows the method of the present invention to remain efficient and effective in the face of increasingly complex and diverse threat environments.

In conclusion, the method not only improves the convenience and efficiency of threat modeling, but also enhances the capability and flexibility of handling various threats.

Drawings

FIG. 1 is a general schematic of the process of the present invention;

FIG. 2 is a schematic diagram of step S2 in the method of the present invention;

FIG. 3 is a schematic representation of step S3 in the process of the present invention.

Description of the embodiments

The following describes specific embodiments of the present invention with reference to the drawings.

As shown in fig. 1, which is a general flow chart of the method of the present invention, the present invention generally comprises the following steps:

s1: constructing a threat model library, wherein the threat model library comprises a plurality of threat models, each threat model consists of corresponding knowledge base, vulnerability base and compliance base information and is provided with corresponding characteristic parameters, and the characteristic parameters reflect the basic attributes of the threat model, including business scenes, threat levels and possible attack modes;

s2: training a machine learning model, wherein the training process of the machine learning model comprises: a threat model is randomly selected from a threat model library, a sample is selected from a service source code and a data stream in a training set to be analyzed, characteristics of the sample and characteristic parameters of the threat model are input into a machine learning model to be compared, a prediction score is generated by calculating similarity between the characteristics of the sample and the characteristic parameters of the threat model, the prediction score is compared with actual scores provided by experts or historical data to obtain errors, and the model parameters are back-propagated and adjusted with the aim of reducing the errors. As shown in fig. 2.

S3: when an optimal threat model is required to be selected from a threat model library according to an actual service source code and a data stream, the machine learning model receives characteristics obtained by analysis of the actual service source code and the data stream as one input, and takes one threat model library selected from the threat model library as the other input, and the machine learning model outputs a prediction score; and comparing the predictive scores corresponding to all threat models in the threat model library, and taking one or more threat model combinations with highest scores as threat models which are most suitable for the current business environment. As shown in fig. 3.

In some embodiments, the training of the model uses a neural network during the machine learning model training process.

In some embodiments, the process of extracting features from the service source code and data stream in S2 and S3 includes:

In some embodiments, the method of calculating the similarity of the sample characteristics to the threat model characteristic parameters is any of the following: a cosine similarity calculation method and an Euclidean distance calculation method.

In some embodiments, the order in which threat models are selected from a threat model library to be input into the machine learning model is based on a particular policy.

In some embodiments, the policies include preferentially selecting threat models that are better in historical cases, or preferentially selecting threat models in newly added bins.

In some embodiments, the machine learning model is trained using an incremental learning approach when a new threat model is added to the threat model library.

In some embodiments, the process of incremental learning includes:

In the following embodiments, the present invention will be explained by taking the development of an e-commerce platform as an example.

Suppose we need to develop an e-commerce platform whose service source code and data stream contains a lot of information such as user account number, account number password, transaction record information, commodity information, user shopping cart information, bank card information, etc. With the development of business, the e-commerce platform may be constantly faced with various new security threats, so some threat modeling methods need to be selected to prevent attacks.

Firstly, a threat model library can be built, then a specific threat model is built, and the threat model can be selected as a model or a modeling reference.

The threat model library may include the following models:

SQL injection attack model: this is a common way of attacking databases, and an attacker tries to control or destroy the database by entering malicious SQL code.

Cross site scripting attack (XSS) model: such attacks typically occur in interactive links such as form submission of websites, where an attacker attempts to affect other users by submitting malicious JavaScript code.

Distributed denial of service (DDoS) attack model: an attacker breaks down the target server through massive requests, and cannot provide normal services.

Traditional password cracking model: attempts are made to break the user's password by means of heuristics or dictionaries.

Zero day vulnerability attack model: attacks against certain system vulnerabilities that are not disclosed or repaired.

Internal malicious behavior model: malicious activities from internal employees, such as information theft, data tampering, etc.

For each threat model, we set a series of characteristic parameters including, but not limited to:

expected traffic scenario: such as login interfaces, transfer pages, registration interfaces, forum posts, background management systems, etc.

Threat level: such as severe, high, medium, low. Wherein, a threat of a serious level may directly cause system crash or data loss, a threat of a high level may cause user data leakage, and a threat of a medium or low level may cause partial function unavailability or user experience degradation.

Possible attack modes: such as social attacks (attacks using vulnerabilities of humans), phishing attacks (decoy users clicking on malicious links or downloading malware), man-in-the-middle attacks (interception and tampering of information during communication), etc.

Each characteristic parameter is encoded into a feature vector and input into a machine learning model for threat prediction and defense.

Training of the machine learning model is performed next:

first, training data is prepared, which typically consists of the service source code and extracted features in the data stream. This includes source code properties (e.g., code structure, function call, API use, data structure, etc.) extracted by the static code analysis tool and data stream properties (e.g., transport protocol, data format, data size, data transmission frequency, etc.) extracted by the dynamic analysis tool. These characteristics will be encoded as feature vectors forming a training data set.

Next, a threat model is randomly selected from the threat model library, and its characteristic parameters are extracted and also encoded as feature vectors. We then input these two types of feature vectors as inputs into the machine learning model.

The machine learning model calculates the similarity between the input feature vector and the threat model feature vector to obtain a predictive score, compares the predictive score with the actual score provided by the expert or historical data, and derives an error value based on the comparison.

The error values are used for back propagation to adjust the parameters of the machine learning model so that the model can be closer to the actual score at the next prediction. This process is repeated until the error value between the predicted score and the actual score of the machine learning model is less than a predetermined threshold, or a predetermined number of training times is reached, such that the machine learning model is trained.

When a new threat model is added into the threat model library, the characteristic parameters of the new threat model can be extracted by using an incremental learning method and added into a training data set, and then the previous training process is repeated, so that the machine learning model is updated, and the machine learning model can adapt to the new threat model.

In this way, our machine learning model is able to adapt quickly and provide accurate threat predictions, whether in the face of a known threat model or a new threat model.

However, in the above process, the "actual score provided by expert or historical data" may be confusing.

For "expert provided actual scoring," this typically means that experienced security experts manually evaluate given business scenarios and threat models, giving them a score that they deem appropriate. Such scoring is typically based on the knowledge and experience of the expert.

As for the "actual score provided by historical data", this is typically based on past events and their results. For example, if a particular threat model has in the past led to serious consequences in a similar business scenario, then the score in this case may be higher. These historical scores can provide learning objectives for machine learning models, enabling the models to better predict what may be encountered in the future.

Of course, more precisely, the specific way in which the actual score is calculated depends on how we define "validity". If we define the effectiveness as "success rate of prevention attack", the actual score may be the number of successful prevention attacks divided by the total number of attacks; if we define the effectiveness as "degree of loss reduction", the actual score may be the loss prevented divided by the total possible loss. The actual scores provided by the expert or historical data may be calculated based on these definitions. Of course, other calculation methods are possible, and the expert may have its own judgment standard, but as long as the judgment standard is fixed, the machine learning target is determined, and then the prediction score naturally tends to be the actual score through the change of the parameters in the model (namely, the training process).

After training the model, when we need to choose the optimal threat model from the threat model library according to the actual service source code and data flow, we can use the machine learning model already trained to score the model.

Still take e-commerce platform as an example. Suppose we are examining a code segment that involves user data processing, which contains business scenarios such as user login behavior, personal information query, shopping cart management, and order processing. We have extracted features from the source code and data stream such as API calls (such as calling APIs to query for user information), data structures (such as those of user information), etc. We then encode these characteristics into feature vectors.

This feature vector is then input into the machine learning model along with the characteristic parameters of each threat model in the threat model library. The machine learning model will output a predictive score that reflects the likelihood of each threat model in the current business environment.

For example, the model may give the following scores:

model score for SQL injection attack 0.8;

model score for cross site scripting attack 0.6;

model score 0.4 for safety header incorrectly configured;

model score 0.7 for sensitive data exposure;

……

by comparing the predictive scores of all threat models, the threat model with the highest score can be selected, namely the model aiming at SQL injection attack, as the current best-adapted threat model.

Using this threat model we can be guided through security checks and safeguards to prevent possible SQL injection attacks. This is an example of how the machine learning model of the present invention may be used to pick the most adaptive threat model in a particular scenario.

Sometimes a single threat model may not fully cover all possible security threats in a complex business scenario. In this case, it may be necessary to pick a plurality of high scoring threat models from a threat model library, combine them together, and protect this business scenario together.

For example, returning to the example of the e-commerce platform above, while the model for SQL injection attacks scores the highest, the model for cross-site scripting attacks and the model for sensitive data exposure score relatively high. This may mean that in this business scenario, all three threats may be present.

Thus, we may choose to combine these three threat models together for security protection. For example, we may reform parameterized queries of databases against possible SQL injection attacks, reform input filtering and output encoding against possible cross-site scripting attacks, reform data encryption and access control against possible sensitive data exposure.

In this way, we can more fully secure business scenarios. It should be noted, however, that model combining adds complexity and possible performance overhead, and thus requires trade-offs for the particular situation.

In addition, in this example we focus on only one code segment that involves user data processing. This code segment may be only a small fraction of millions of code lines on an e-commerce platform. The purpose here is to simplify the discussion, and more clearly explain how the most adapted one is chosen from a large number of threat models.

In practical applications, comprehensive analysis is required for all code segments and data streams of the e-commerce platform. This may include various business scenarios such as user registration, password recovery, merchandise browsing, order payment, customer service, etc.

When analyzing each business scenario, relevant characteristics are extracted, then the characteristics are input into a machine learning model, and the most suitable threat model is selected from a threat model library. It is possible that different business scenarios will pick up different threat models. Thus, a comprehensive and personalized threat model system can be built for the whole electronic commerce platform.

This system may contain multiple models, each model being responsible for dealing with one or a class of specific threats. For example, some models are responsible for defending against SQL injection, some models are responsible for defending against cross-site scripting attack, some models are responsible for defending against password cracking, etc.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should be covered by the protection scope of the present invention by making equivalents and modifications to the technical solution and the inventive concept thereof.

Claims

1. A scene threat modeling method based on multi-library fusion is characterized by comprising the following steps:

2. The method for modeling a scene threat based on multi-library fusion according to claim 1, wherein in the training process of the machine learning model, a neural network is used for training the model.

3. The multi-library fusion-based scenerized threat modeling method of claim 1, wherein the process of extracting characteristics from the service source code and data stream in S2 and S3 comprises:

4. The multi-library fusion-based scenerized threat modeling method of claim 1, wherein the method for calculating the similarity between the sample characteristic and the threat model characteristic parameter is any one of the following: a cosine similarity calculation method and an Euclidean distance calculation method.

5. The multi-library fusion-based scenerized threat modeling method of claim 1, wherein in S3, the order in which threat models are selected from the threat model library to be input into the machine learning model is based on a specific policy.

6. The method for modeling a scene threat based on multi-library fusion of claim 5, wherein,

the policy includes a preference for threat models in a newly added warehouse.

7. The multi-library fusion-based scenerized threat modeling method of claim 1, wherein the machine learning model is trained by incremental learning when a new threat model is added to the threat model library.

8. The multi-library fusion-based scenerized threat modeling method of claim 7, wherein the incremental learning process comprises: