CN117708813B

CN117708813B - Security detection method and system for software development environment

Info

Publication number: CN117708813B
Application number: CN202311618733.7A
Authority: CN
Inventors: 黄诚; 李乐融; 曾雨潼; 余泓豪; 徐建斌
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-06-21
Anticipated expiration: 2043-11-30
Also published as: CN117708813A

Abstract

The invention discloses a security detection method and a system of a software development environment, wherein the security detection method comprises the following steps: performing blacklist screening on code editor Visual Studio Code plug-ins and browser plug-ins to be detected; performing static extraction of code behaviors on plug-ins which are not in a blacklist, and acquiring an API call sequence in the code by combining abstract syntax trees with regular expression matching; modeling malicious plug-in behaviors by utilizing feature engineering, and statically detecting the maliciousness of the plug-in by using a random forest classifier; and further classifying the plug-in judged to be malicious by using a hierarchical classifier to obtain a specific malicious category. The invention also provides a system for realizing the method, which can carry out online analysis and local scanning on the browser plug-in and the code editor plug-in the development environment, and realizes comprehensive safety detection on the two types of plug-ins. The method and the device can effectively identify and accurately classify the malicious behaviors of the plug-in, and provide more comprehensive safety protection for software developers.

Description

Security detection method and system for software development environment

Technical Field

The invention relates to a network security technology, in particular to a security detection method and system of a software development environment.

Background

With the continuous expansion of the software development ecosystem, the tool plug-ins of the developer are increasingly used by the developer, the development environment is also more and more complex, and the safety of the development environment is difficult to ensure. An attacker can steal sensitive data in a development environment by means of a malicious third party plug-in unit introduced in the software development environment, so that the data is leaked, and the developed program can be injected with malicious codes, so that huge potential safety hazards exist. Therefore, detecting the malicious plug-ins existing in the current development environment of the developer can help provide security for the developer in the step of developing tools.

In the prior art, a method and a system for comprehensively scanning and detecting a code editor plug-in and a browser plug-in an development environment are not available. In terms of browser plug-in Security detection, the Duo Security of Cisco issues CRXcavator, and an automated Chrome extension Security assessment tool user can log on to the platform, and obtain risk indexes and analysis reports of specified extensions by retrieving extension names or extension IDs of the Chrome browser extension, but has the following drawbacks: the user can only search according to the extension name and ID, and can not analyze the plug-in which is not detected on line; the limitation of the platform is large, and plug-ins of other platforms cannot be detected; the security of the plug-in the local environment of the developer cannot be detected, and the security of the environment of the developer cannot be guaranteed. In terms of malicious JavaScript code detection, aurore Fass et al propose a modularized static JavaScript detection system JStap, which utilizes code control flow and data flow information to expand on the basis of detection modes based on lexical and abstract syntax tree, but the detection method can only classify codes as malicious and benign, but cannot realize further accurate classification of specific malicious behaviors of the codes.

Disclosure of Invention

In view of the above, an object of the present invention is to provide a security detection method and system for a software development environment, which can perform security analysis on a code editor Visual Studio Code (VSCode) plug-in and a browser plug-in used by a developer; the API calling sequence can be accurately extracted by combining an Abstract Syntax Tree (AST) with a regular expression; through the hierarchical classifier from API to sensitive behavior to the category of the malicious plugin, the multi-classification accurate detection of the malicious plugin is realized, and the specific category of the malicious plugin can be identified; the method can carry out online analysis and local scanning on browser plug-ins and code editor plug-ins in a software development environment, realizes comprehensive safety detection on the two types of plug-ins, and provides safety guarantee for developers.

In order to achieve the above object, the present invention is achieved by the following technical scheme:

the first aspect of the present invention provides a security detection method for a software development environment, including the steps of:

S100: carrying out blacklist screening on VSCode plug-ins to be detected and Web browser plug-ins to judge whether the plug-ins belong to known malicious plug-ins or not;

s200: performing static extraction of code behaviors on the plug-ins which are not in the malicious plug-in blacklist in the S100, and efficiently acquiring an API call sequence in the code by combining abstract syntax trees with regular expression matching;

S300: modeling malicious plug-in behaviors through feature engineering on the basis of the API call sequence extracted in the step S200, and statically detecting the maliciousness of the plug-in by using a random forest classifier to classify the plug-in as malicious and benign;

S400: and mapping the API call sequence into specific sensitive behaviors through a hierarchical classifier for the malicious plug-ins output in the S300, and classifying the malicious plug-ins according to the front-to-back sequence of the behaviors to obtain the types of the malicious plug-ins.

Preferably, in S100, the step of detecting a malicious plug-in based on a blacklist specifically includes:

s110: collecting acknowledged malicious plug-in data from the Internet, integrating malicious plug-in information detected and verified through a detection flow, constructing a malicious plug-in blacklist, and periodically synchronizing the newly discovered malicious plug-in information so as to ensure the comprehensiveness and timeliness of the blacklist content;

S120: with the blacklist constructed in step S110, a unique Identifier (ID) of the plug-in to be detected is queried, and if the ID exists in the blacklist, the plug-in is marked as a malicious plug-in.

Preferably, in S200, the step of extracting plug-in code behavior static specifically includes:

s210: firstly, attempting to analyze a source code by using an abstract syntax tree, and acquiring an API call sequence by traversing the generated abstract syntax tree;

S220: if the generation of the abstract syntax tree fails, regular matching is started as a standby scheme, all sensitive API calls in the source code are firstly identified, and then the API calls are ordered through a specific algorithm.

Preferably, in the step of obtaining the API call sequence by using the abstract syntax tree, the method specifically includes:

s211: parsing the source code into an abstract syntax tree using a syntax parser;

S212: the abstract syntax tree generated by hierarchical traversal acquires a custom function declaration subtree and a residual code subtree, and enqueues the subtrees respectively;

S213: and declaring a subtree queue for the custom function, executing dequeuing operation on the subtree queue according to the subtree queue, and traversing the dequeued subtree by adopting a preamble to acquire the sensitive API call. If the custom function has the call of the custom function and the custom function is traversed, inserting the API call sequence into the API call sequence of the called function; and re-enqueuing the subtree if the called custom function is not updated. Marking the self-defined function obtained by the API call sequence as updated until the API call sequence of all the self-defined functions is updated;

S214: sequentially executing dequeuing operation on the residual code subtree queues, traversing dequeued subtrees by adopting a preamble to acquire sensitive API calls, and inserting an ordered API call sequence of the custom function into a source file API call sequence from the back if the traversed current node is the call of the custom function;

S215: s213 and S214 are repeated until the custom function declares that the subtree queue and the remaining code subtree queue are empty, outputting an API call sequence.

Preferably, the step of obtaining the API call sequence by using regular matching specifically includes:

S221: matching the text range of the function definition block in a regular matching mode;

s222: identifying all sensitive API calls in the source file, and sorting by adopting selective sorting according to the text relative positions of the API calls;

s223: adding the API call in the function definition block into a function call sequence corresponding to the custom function;

s224: recursively updating the API call sequence of the custom function by inserting the ordered API call sequence of the function called in the definition block into the API call sequence of the custom function until all the custom functions are updated;

s225: removing all API calls located in the function definition block from an API call sequence of the source file;

S226: and identifying the call of the custom function in the source file, and inserting the API call sequence into the API call sequence of the source file according to the call position.

Preferably, in S300, the step of detecting the plug-in maliciousness in a static manner specifically includes:

s310: defining behavior sequence characteristics for judging whether suspicious behaviors exist in the plug-in according to analysis of malicious plug-in samples, wherein each characteristic is an ordered combination of a plurality of behaviors and represents a specific sensitive behavior;

s320: mapping the API call sequence extracted in the step S200 into a behavior sequence feature vector;

S330: and (3) sending the feature vectors into a trained random forest classifier for prediction, wherein the random forest consists of a plurality of decision trees, and each decision tree evaluates the feature vectors and decides a final classification result in a voting mode.

Preferably, the training method of the random forest classifier is as follows:

s331: constructing a data set comprising benign samples and malicious samples;

s332: extracting defined behavior sequence features from the data set by using a static analysis method to form a behavior feature matrix;

s333: dividing the whole data set into a training set and a testing set according to the proportion of 8:2, and using the training set and the testing set for model training and evaluation;

S334: by randomly sampling the training data set, constructing decision trees for feature sets obtained by each sampling, and dividing each tree by selecting the optimal features to improve the node purity;

S335: traversing key parameters of the random forest model by adopting a grid search method, wherein the key parameters comprise the number of classifiers and the depth of a decision tree so as to select optimized model parameters;

s336: and evaluating the performance of the model on the test set, and storing the model with the best performance for future prediction tasks.

The invention also provides a security detection system of the software development environment, which comprises:

User interaction and security situation module: the system is used for providing an interactive interface to search, upload plug-ins to be detected or scan plug-ins installed locally, and displaying the safety state, detection results and history detection records of a development environment;

plug-in information acquisition module: the plug-in source code is used for automatically collecting plug-in source codes on a plug-in market or platform and providing a data base for security analysis;

client local development environment security detection module: the method comprises the steps of running at a user end, automatically scanning a designated plug-in installation path, and obtaining VSCode plug-ins and browser plug-ins installed in a local development environment;

Plug-in maliciousness static detection module: the method comprises the steps of extracting an API call sequence by combining abstract syntax trees with regular expression matching, and judging the maliciousness of a plug-in by using a random forest model;

Plug-in malicious behavior determination module: for the plug-in which is preliminarily judged to be malicious, the module maps the API call sequence into specific sensitive behaviors through a hierarchical classifier, and then classifies the malicious plug-in according to the front-to-back sequence of the behaviors to obtain the category of the malicious plug-in;

and a user management module: the method is used for processing registration, login and authority verification of the user and ensuring the personalized security state and the security access of the history detection data;

and a data storage module: the method is used for storing all data, including plug-in information, detection results, user configuration and system logs, and ensures the safety and the integrity of the data.

Compared with the prior art, the invention has the beneficial effects that: by comprehensively utilizing static code analysis, machine learning classification and layering behavior classification technologies, the invention provides a comprehensive and accurate development environment security detection method, and compared with the prior art, the comprehensive and accurate development environment security detection method can be used for more effectively identifying malicious plug-ins and realizing multi-classification accurate detection on the malicious behavior types, thereby providing more comprehensive security guarantee for a code editor and a browser of a developer.

Drawings

FIG. 1 is a general flow chart in a first embodiment of the invention;

FIG. 2 is a flow chart of a fetch API call sequence using an abstract syntax tree in a first embodiment of the invention;

FIG. 3 is a flow chart of acquiring an API call sequence using canonical matching in a first embodiment of the invention;

fig. 4 is a system configuration diagram in a second embodiment of the present invention;

fig. 5 is a system flow diagram in a second embodiment of the invention.

Detailed Description

The following detailed description of specific embodiments of the invention refers to the accompanying drawings and detailed description. The following examples or figures are illustrative of the invention and are not intended to limit the scope of the invention.

Embodiment one:

referring to fig. 1, a security detection method of a software development environment includes the following steps:

S100: and carrying out blacklist screening on VSCode plug-ins and Web browser plug-ins to be detected to judge whether the plug-ins belong to known malicious plug-ins or not.

In one embodiment, the step S100 includes the steps of:

s110: the method comprises the steps of collecting acknowledged malicious plug-in data from the Internet, integrating malicious plug-in information detected and verified through a detection flow, constructing a malicious plug-in blacklist, and periodically synchronizing the newly discovered malicious plug-in information so as to ensure the comprehensiveness and timeliness of the blacklist content.

The internet malicious plug-in data collection sources include samples disclosed by well-known security companies Snyk, kaspersky and Threatmon; compared with the existing scheme, in the embodiment, not only is the data collection carried out on the Internet in many aspects, but also malicious plug-in data detected and confirmed in the detection process of VSCode and browser plug-ins in the embodiment are added on the basis of the published data on the Internet, so that the problems of incomplete coverage of a single data source, insufficient description angle, insufficient description content and untimely data update can be solved.

S120: using the blacklist constructed in step S110, a unique Identifier (ID) of the plug-in to be detected is queried, and if the ID exists in the blacklist, the plug-in is marked as a malicious plug-in.

S200: and (3) performing static extraction on code behaviors of the plug-ins which are not in the malicious plug-in blacklist in the S100, and efficiently acquiring an API call sequence in the code by combining the abstract syntax tree with regular expression matching.

Specifically, in this embodiment, a policy of "mainly abstract syntax tree and auxiliary regular expression matching" is adopted, so as to fully utilize the advantages of the abstract syntax tree in terms of performing accurate syntax analysis and sensitive API call recognition. Because of the inherent limitations of the variety of JavaScript grammar and the parsing tools, when the conventional syntax analysis encounters an obstacle, regular expression matching is used as an effective supplement to make up for the shortages of the abstract syntax tree in the parsing of some complex syntax structures. Compared with the prior art, the method for extracting the API call sequence ensures wide compatibility of various coding styles and modes, can process continuous iteration and recursion, and further enhances the analysis capability of nesting and complex function structures.

In one embodiment, the step S200 includes the steps of:

S210: first, the abstract syntax tree is used for analyzing the source code, and an API call sequence is obtained through traversing the generated abstract syntax tree.

In the process of traversing the abstract syntax tree, considering that the API call nodes in the function declaration block cannot be executed at the declaration positions, the subtrees are divided into two classes when the abstract syntax tree is parsed: one class of subtrees is exclusively representative of custom function declaration blocks, and the remaining other subtrees are categorized as a second class of subtrees. The API call sequence obtained by traversing the second class subtree represents the API call sequence of the code under actual execution. And for processing the custom function, traversing the custom function subtree by adopting the same traversing sequence as the second class subtree, updating the ordered API call sequence of the custom function, and inserting the ordered API call sequence when the ordered API call sequence is called.

Referring to FIG. 2, in one embodiment, the fetching of API call sequences using an abstract syntax tree includes the steps of:

s211: the source code is parsed into an abstract syntax tree using a syntax parser.

Specifically, the grammar parser is Esprima, a popular, high-performance ECMAScript parser.

S212: the abstract syntax tree generated by the hierarchical traversal acquires the custom function declaration subtree and the residual code subtree, and enqueues them respectively.

S213: declaring a subtree queue for the custom function, sequentially executing dequeuing operation on the subtree queue, and traversing the dequeued subtree by adopting a preamble to acquire a sensitive API call; if the custom function has the call of the custom function and the custom function is traversed, inserting the API call sequence into the API call sequence of the called function; if the called custom function is not updated, re-queuing the subtree; and marking the custom function obtained by the API call sequence as updated until the API call sequence of all the custom functions is updated.

S214: and sequentially executing dequeuing operation on the residual code subtree queues, traversing dequeued subtrees by adopting a preamble to acquire sensitive API calls, and inserting an ordered API call sequence of the custom function into a source file API call sequence from the back if the traversed current node is the call of the custom function.

Referring to FIG. 3, in one embodiment, the acquiring API call sequence using canonical matching includes the steps of:

s221: the text range of the block is defined by matching functions in a regular matching mode.

S222: all sensitive API calls in the source file are identified and sorted according to the relative position of the text of the API call using a selection sort.

S223: and adding the API call in the function definition block into a function call sequence corresponding to the custom function.

S224: the API call sequence of the custom function is recursively updated by inserting the ordered API call sequence of the function called in the definition block into the API call sequence of the custom function until all the custom functions are updated.

S225: all API calls located within the function definition block are removed from the API call sequence of the source file.

S300: modeling malicious plug-in behaviors through feature engineering on the basis of the API call sequence extracted in the step S200, and classifying the plug-in into malicious and benign by utilizing a random forest classifier to statically detect the maliciousness of the plug-in.

In an actual production environment, the duty ratio of the malicious plug-in is extremely low, so that preliminary malicious judgment is carried out on the plug-in through static detection, most normal plug-in is rapidly eliminated, a large amount of resource consumption and detection time delay can be reduced, and the detection efficiency is improved.

In one embodiment, the static detection of plug-in maliciousness includes the steps of:

S310: according to analysis of malicious plug-in samples, behavior sequence features for judging whether the plug-in has suspicious behaviors are defined, and each feature is an ordered combination of a plurality of behaviors and represents a specific sensitive behavior.

For example, one possible behavior sequence feature includes: the method comprises the steps of sending sensitive information, inquiring system environment variables, downloading content and executing, writing in a file and executing, reading the file content and executing, reading the file and dynamically executing codes, modifying file authority and creating processes, identifying an operating system platform, modifying data flow of system command execution results, executing system commands and executing sensitive file operations. Each behavior sequence feature is a combination of multiple behaviors, for example, the behavior sequence of outgoing sensitive information includes two behaviors, access to the sensitive information is performed first, and then network request outgoing is performed.

S320: the API call sequence extracted in S200 is mapped to a behavior sequence feature vector.

Preferably, the training method of the random forest classifier is as follows:

s331: a data set is constructed containing benign samples and malicious samples.

Specifically, one possible data set construction method is to select malicious samples from malicious plug-in information disclosed by the known security companies Snyk, kaspersky and Threatmon, and select samples with large download amount and high user score from a ranking list of the official plug-in market as benign samples.

S332: and extracting the defined behavior sequence characteristics from the data set by using a static analysis method to form a behavior characteristic matrix.

S333: the whole data set is divided into a training set and a test set according to the proportion of 8:2 so as to facilitate model training and evaluation.

S334: by randomly sampling the training data set, a decision tree is constructed for each sampled feature set, and each tree is segmented by selecting the best features to improve node purity.

S335: the key parameters of the random forest model, including the number of classifiers and the depth of the decision tree, are traversed by adopting a grid search method to select the optimized model parameters.

Specifically, the format of the model file save in this embodiment is Joblib.

Specifically, the hierarchical classifier has three layers, namely a malicious plug-in class, a sensitive behavior class and a sensitive API from top to bottom.

For the malicious plug-in category, through research on the existing malicious plug-in report and manual analysis on a malicious sample, one possible malicious plug-in behavior is classified as: sensitive information theft, sensitive file operation, malicious command execution, code injection, advertisement injection, and browser hijacking.

The behavior of the malicious plug-in is composed of a series of sensitive behaviors, and the embodiment associates the category of the malicious plug-in with the corresponding sensitive behavior sequence. For example, the class of sensitive information theft follows the sequence of actions of "access sensitive information first, then send over the network". One possible class of sensitive behavior is defined as: network transmission, network downloading, file reading, file deleting, file modifying, file creating, code executing, system command executing, external program executing, process information, system information acquiring and sensitive file operating.

Implementation of sensitive behavior the implementation of sensitive behavior is independent of API calls, in this embodiment, the sensitive behavior sequence is associated with the sensitive API call sequence, and one feasible sensitive API definition method is to define an API that can be used for malicious purposes by using APIs collected in existing research, in combination with analysis of malicious samples.

Embodiment two:

a security detection system for a software development environment, see fig. 4, comprising the following modules:

User interaction and security situation module: the method is used for providing an interactive interface to search, upload plug-ins to be detected or scan plug-ins installed locally and displaying the safety state, detection results and historical detection records of the development environment.

Plug-in information acquisition module: the plug-in source code is used for automatically collecting plug-in source codes on a plug-in market or platform and providing a data base for security analysis.

Client local development environment security detection module: the method is used for running at a user side, automatically scanning the designated plug-in installation path and acquiring VSCode plug-ins and browser plug-ins installed in the local development environment.

Plug-in maliciousness static detection module: the method is used for extracting the API call sequence by combining abstract syntax tree with regular expression matching and judging the maliciousness of the plug-in by using a random forest model.

Plug-in malicious behavior determination module: for judging the category of the malicious plug-in, the module maps the API call sequence into specific sensitive behaviors through a hierarchical classifier, and then classifies the malicious plug-in according to the front-to-back sequence of the behaviors to obtain the category of the malicious plug-in.

And a user management module: the method is used for processing registration, login and authority verification of the user and ensuring the personalized security state and the security access of the historical detection data.

The use flow chart of the system is shown in fig. 5, and the method comprises the following steps:

step 1: the user logs in to the homepage of the detection website and selects the detection mode. For example, the user may select an input box search plug-in or upload plug-in compression package or scan a local plug-in.

Step 2: and executing corresponding processing according to the detection mode selected by the user.

Step 2.1: if the user selects to search the plug-in the input box, inquiring whether the plug-in exists in the database according to the name of the plug-in or the URL of the plug-in, if the plug-in exists in the database, directly outputting the plug-in information stored in the database, and if the plug-in is not in the database, downloading the plug-in compression package through the web crawler, and inputting the next step.

Step 2.2: if the user selects to upload the plug-in compression package, the plug-in compression package is input to the next step.

Step 2.3: if the user selects to scan the local plugin, the user downloads the local scanning program, executes the local plugin scanning, inquires whether the plugin obtained by the scanning exists in the database, directly outputs plugin information stored in the database if the plugin exists in the database, and inputs the plugin compression package into the next step if the plugin does not exist in the database.

Step 3: and compressing the input plug-in package, classifying the plug-in package into malicious and benign types through a plug-in malicious static detection module, and inputting the plug-in package which is classified into malicious types into the next step.

Step 4: and classifying the malicious plugins into sensitive information stealing, sensitive file operation, malicious command execution, code injection, advertisement injection and browser hijacking through a plugin malicious behavior judging module.

Step 5: and displaying the detection result to the user and storing the detection result in a database.

It should be noted that the present system particularly implements the detection method described in detail in the first embodiment, and integrates the detection method into an automation flow, so as to improve the detection efficiency and the interactivity of the user. Furthermore, the design of the present system allows for flexible extensions and updates to accommodate new security threats and plug-in features.

It should be understood by those skilled in the art that the embodiments described in the specification are all preferred embodiments and should not be considered as excluding other embodiments, and that the actions and processes involved in the embodiments may be modified by those skilled in the art without departing from the spirit and scope of the present invention, which is intended to be within the scope of the appended claims.

Claims

1. The security detection method of the software development environment is characterized by comprising the following steps of:

S100: carrying out blacklist screening on Visual Studio Code plug-ins and Web browser plug-ins to be detected to judge whether the plug-ins belong to known malicious plug-ins or not;

The step S200 includes the steps of:

S220: if the generation of the abstract syntax tree fails, regular matching is started as a standby scheme, all sensitive API calls in the source code are firstly identified, and then the API calls are sequenced through a specific algorithm;

the step S210 includes the steps of:

S213: declaring a subtree queue for the custom function, sequentially executing dequeuing operation on the subtree queue, and traversing the dequeued subtree by adopting a preamble to acquire a sensitive API call; if the custom function has the call of the custom function and the custom function is traversed, inserting the API call sequence into the API call sequence of the called function; if the called custom function is not updated, re-queuing the subtree; marking the self-defined function obtained by the API call sequence as updated until the API call sequence of all the self-defined functions is updated;

S215: repeating S213 and S214 until the custom function declares that the subtree queue and the remaining code subtree queue are empty, and outputting an API call sequence;

The step S220 includes the steps of:

s226: identifying the call of the custom function in the source file, and inserting the API call sequence into the API call sequence of the source file according to the call position;

2. The method for detecting the security of a software development environment according to claim 1, wherein S100 comprises the steps of:

S120: and querying a unique Identifier (ID) of the plug-in to be detected by using the blacklist constructed in the step S110, and if the ID exists in the blacklist, marking the plug-in as a malicious plug-in.

3. The method for detecting the security of a software development environment according to claim 1, wherein the step S300 comprises the steps of:

S310, defining behavior sequence characteristics for judging whether suspicious behaviors exist in the plug-in according to analysis on malicious plug-in samples, wherein each characteristic is an ordered combination of a plurality of behaviors and represents a specific sensitive behavior;

4. A method according to claim 3, wherein the training method of the random forest classifier comprises the steps of:

s331: constructing a data set comprising benign samples and malicious samples;

s333: dividing the whole data set into a training set and a testing set according to the proportion of 8:2 so as to facilitate model training and evaluation;

5. A security detection system of a software development environment for implementing the security detection method of a software development environment according to any one of claims 1 to 4, comprising:

Client local development environment security detection module: the method comprises the steps of running at a user end, automatically scanning a designated plug-in installation path, and obtaining Visual Studio Code plug-ins and browser plug-ins installed in a local development environment;