CN114595454B

CN114595454B - Malicious JS script detection method based on mixed analysis and feature fusion

Info

Publication number: CN114595454B
Application number: CN202210252529.7A
Authority: CN
Inventors: 孙聪; 乔新博; 陈亮
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2024-04-02
Anticipated expiration: 2042-03-11
Also published as: CN114595454A

Abstract

The invention provides a malicious JavaScript detection method based on mixed analysis and feature fusion, which is used for solving the technical problem of lower detection precision in the prior art, and comprises the following implementation steps: (1) obtaining a training sample set and a test sample set; (2) Constructing a malicious JavaScript detection network model based on feature fusion; (3) Performing iterative training on a malicious JavaScript detection network model based on feature fusion; and (4) obtaining a detection result of the malicious JavaScript script. According to the method, the malicious JavaScript script detection network model based on feature fusion is used for fusing dynamic and static features and classifying, so that the problem that the prior art directly splices the sequence information among the damaged features caused by the fact that the dynamic and static features are input into the random forest algorithm model is avoided, and the detection precision of the malicious JavaScript script is effectively improved.

Description

Malicious JS script detection method based on mixed analysis and feature fusion

Technical Field

The invention belongs to the technical field of information safety, relates to a malicious JavaScript detection method, and relates to a malicious JavaScript detection method specifically based on mixed analysis and feature fusion, which can be used for detecting malicious JavaScript subjected to complex confusion protection.

Background

The JavaScript script serves as one of the most popular front-end scripts in the world, plays an important role in the internet, and operates the browser by using an Application Program Interface (API), so that the JavaScript is used for optimizing an interface, verifying form data, checking browser information, responding to browser operation, controlling login credentials and the like.

The dynamic characteristic of the JavaScript script greatly simplifies the development work of the front end of the browser. Firstly, the JavaScript script has cross-platform property, only needs the support of a browser, and does not depend on an operating system. Secondly, the JavaScript script has dynamic characteristics, and the JavaScript script is simple and flexible in grammar and is very suitable for complex and changeable browser tasks. In addition, javaScript script is an explanatory language that can be executed while being interpreted without precompiled.

While JavaScript script cross-platform and dynamic have great advantages in browser front-end development, it is one of the main carriers of hacking. Malicious attacks such as Drive-by-Download attacks, cross site scripting attacks (XSS), heap injection attacks (Heap spraying attacks), click hijacking attacks (ClickJacking), etc. The attacks can steal the data of the user, create malicious worms capable of copying by themselves, control the browser of the user to download malicious software, and pose a great threat to the information security of the Internet user. And the JavaScript script also needs to call a different API in order to complete a malicious attack. Therefore, it has become an important task to study how to accurately and effectively detect malicious JavaScript.

At present, detection technologies for malicious JavaScript scripts are mainly divided into detection technologies based on static analysis, detection technologies based on dynamic analysis and detection technologies based on mixed analysis. The detection technology based on static analysis mainly utilizes static analysis technologies such as source code analysis, lexical analysis, grammar tree analysis, and the like to analyze JavaScript scripts without running a program, extracts static characteristics of the JavaScript scripts, and utilizes the static characteristics to detect malicious JavaScript scripts. The detection technology based on dynamic analysis mainly utilizes dynamic instrumentation, sandbox execution and other modes to extract dynamic characteristics of the JavaScript script, such as file read-write times, function call, variable value tracking and other information, and judges the maliciousness of the JavaScript script through the dynamic characteristics. The detection technology based on the mixed analysis not only uses static analysis to extract static characteristics, but also uses dynamic analysis to extract dynamic characteristics, and comprehensively judges the maliciousness of the JavaScript script through the static characteristics and the dynamic characteristics.

Along with the development of machine learning technology and deep learning technology, convolutional neural network CNN, two-way long-short-term memory neural network BiLSTM and random forest algorithm model are also used in malicious JavaScript detection. CNN is a feedforward neural network with a depth structure and containing convolution calculation, and CNN can consider the spatial distribution of input so as to capture the sequence information among input features; the BiLSTM is a time-cycled neural network, and the input of the LSTM layer not only comprises the output of the input layer, but also comprises the output of the LSTM layer at the last moment, and can capture the sequence information among input features; the random forest algorithm is an algorithm for integrating a plurality of decision trees through an integrated learning idea, and features are randomly selected in the process of generating a random forest, so that sequence information among adjacent features is destroyed.

He Xincheng, xu Lei et al in paper Malicious JavaScript Code Detection Based on Hybrid Analysis (APSEC), 2018, pp.365-374 disclose a detection method of malicious JavaScript by mixed analysis, the method comprises the steps of collecting relevant webpage source codes, and extracting JavaScript in the source codes and JavaScript embedded in an HTML document; secondly, constructing an abstract syntax tree for each JavaScript script, analyzing nodes in each abstract syntax tree and extracting features from the nodes, wherein the extracted static features comprise: 13 features including number of coding operations, number of redirection operations, number of spaces, total number of lines of code, etc.; then, inserting piles for each JavaScript script, overwriting basic operation in operation to be monitored, and counting information of the JavaScript script in operation as dynamic characteristics, wherein the extracted dynamic characteristics comprise 12 characteristics such as the reading and writing times of an object, the times of binary operation and the like; then, the dynamic and static features are rewritten into feature vectors, and the feature vectors are used as a training set and a testing set to train a random forest algorithm model; and finally, testing a random forest algorithm model through a test set and evaluating indexes.

The method has the defects that firstly, the extracted dynamic and static characteristics are the frequency characteristics of a certain identifier and the occurrence times of a certain operation, which belong to the frequency characteristics of the JavaScript script, the sequential information of the certain operation and the certain identifier in the JavaScript script is ignored, and the code structure information of the JavaScript script is lost; and secondly, the characteristic fusion mode of the method is to directly splice dynamic and static characteristics obtained by mixed analysis before the random forest algorithm model is input, and the characteristic fusion mode is more applicable to the method, but the random forest algorithm model cannot directly analyze sequence information among the characteristics, and the sequence information among the characteristics can be influenced by the directly spliced characteristics. For the above reasons, this method also has a drawback in detection accuracy.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a malicious JS script detection method based on mixed analysis and feature fusion, which is used for solving the technical problem of lower detection precision in the prior art.

In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps:

(1) Acquiring a training sample set and a test sample set:

(1a) Obtaining V JavaScript scripts J= { F ^v V is more than or equal to 1, and each JavaScript F is operated ^v For running F ^v Dynamically monitoring the action of calling the application program interface API to obtain an API calling sequence S corresponding to J ₁ ＝{W ₁ ^v V is equal to or more than 1 and is equal to or less than V, and simultaneously, F is ^v Performing static grammar analysis to obtain abstract grammar tree AST ^v Accessing an AST through a depth-first traversal algorithm ^v To obtain a syntax element sequence S corresponding to J ₂ ＝{W ₂ ^v V is equal to or more than 1 and is equal to or less than V, wherein V is more than 10000, F ^v Representing the v-th JavaScript footBook, W ₁ ^v 、W ₂ ^v Respectively represent F ^v A corresponding API call sequence, a grammar unit sequence;

(1b) For each API call sequence W, a Word2Vec model is utilized ₁ ^v Syntax element sequence W ₂ ^v Conversion to S ₁ API call sequence word vector of (c)S ₂ Is>

(1c) For each API call sequence word vectorAnd per syntax element sequence word vector +.>Labeling by the same tag and adding T ₁ More than half of API call sequence word vectors and T ₂ More than half of grammar unit sequence word vectors and labels shared by every two vectors form a training sample set Q ₁ Then T is taken ₁ The remaining API call sequence word vector sum T ₂ The rest grammar unit sequence word vectors in the list and the labels common to every two vectors form a test sample set Q ₂ ；

(2) Constructing a malicious JavaScript detection network model based on feature fusion;

constructing a malicious JavaScript detection network model comprising a convolutional neural network CNN and a two-way long-short-term memory neural network BiLSTM which are arranged in parallel, wherein the output ends of the CNN and the BiLSTM are connected with a feature fusion module and a feature classifier E in sequence; wherein the CNN comprises a convolution layer and a maximum pooling layer which are overlapped with each other, and the activation function of a convolution kernel in the convolution layer is relu; the BiLSTM comprises a forward LSTM layer and a backward LSTM layer which are overlapped with each other, and an activating function of an LSTM unit in the LSTM layer is sigmoid; e comprises a full connection layer and a sigmoid activation function output layer;

(3) Performing iterative training on a malicious JavaScript detection network model based on feature fusion;

(3a) Initializing iteration number as I, maximum iteration number as I, I being more than or equal to 200, and detecting weight matrix of network model based on malicious JavaScript of feature fusion as followsThe offset matrix is +.>Let i=0;

(3b) Will train sample set Q ₁ As the malicious JavaScript script based on feature fusion detects the input of the network model to forward spread, the CNN extracts the high-dimensional features of the grammar unit sequence in each training sample; meanwhile, biLSTM extracts high-dimensional characteristics of the API call sequence in each training sample; the feature fusion module fuses each high-dimensional feature extracted by the CNN with the corresponding high-dimensional feature extracted by the BiLSTM, and the feature classifier E maps each fused high-dimensional feature into a vector and inputs sigmoid to obtain the prediction probability of each training sample

(3c) By cross entropy loss functionCalculating a loss value L of a malicious JavaScript detection network model based on feature fusion in the iteration ⁱ The method comprises the steps of carrying out a first treatment on the surface of the By back-propagation and by loss value L ⁱ Calculating weight matrix gradient of malicious JavaScript detection network model based on feature fusion>Offset matrix gradient->By gradient descent method>Weight matrix->And offset matrix->Updating;

(3d) Judging whether I & gtI is true or not, if yes, obtaining a trained malicious JavaScript detection model based on feature fusion, otherwise, enabling i=i+1, and executing the step (3 b);

(4) Obtaining a detection result of a malicious JavaScript script:

test sample set Q ₂ Forward propagation is carried out as the input of a trained malicious JavaScript detection model based on feature fusion, and the prediction probability of each test sample is obtainedIf->Then Q ₂ The JavaScript script corresponding to the kth training sample is malicious, otherwise, the JavaScript is normal.

Compared with the prior art, the invention has the following advantages:

1. the malicious JavaScript detection network model constructed by the invention comprises CNN and BiLSTM which are arranged in parallel and a feature fusion module which is connected with the output end of the CNN and BiLSTM in sequence, wherein in the process of training the model and acquiring the detection result of the malicious JavaScript, the convolution layer of the CNN can capture the space sequence information of a grammar unit with a certain length through convolution of a convolution kernel; the forward LSTM layer and the backward LSTM layer of the BiLSTM can capture API space sequence information with a certain length; the CNN and the BiLSTM respectively process different sequences, so that the defect that the prior art directly splices sequence information among damaged features caused by dynamic and static feature input random forest algorithm models is avoided, and the detection accuracy of malicious JavaScript scripts is effectively improved.

2. According to the method, an API call sequence obtained by dynamically monitoring the running behavior of calling an application program interface API of each JavaScript script is converted into a grammar unit sequence obtained by carrying out static grammar analysis on each JavaScript script, and a training sample set containing dynamic information and static information is obtained. The API call sequence and the grammar unit sequence in the training sample set are provided with sequence information, so that the defect that the prior art only extracts the frequency characteristic of the JavaScript script, loses the code sequence information in the JavaScript script, and further improves the detection precision of the malicious JavaScript script.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention is described in further detail below with reference to the attached drawings and to specific embodiments:

referring to fig. 1, the present invention includes the steps of:

step 1) obtaining a training sample set and a test sample set:

(1a) Obtaining V JavaScript scripts J= { F from the Internet by utilizing a web crawler ^v Modifying a sandbox realizing the JavaScript script API, adding a monitoring function into the code of the JavaScript script API in the sandbox, and running each JavaScript script F by using the modified sandbox ^v When F ^v When calling the API, the monitoring function in the API code is also called, the monitoring function records the called API name, and an API calling sequence S corresponding to J is obtained ₁ ＝{W ₁ ^v V is equal to or more than 1 and is equal to or less than V, and meanwhile, the Esprima tool is used for F ^v Performing static grammar analysis to obtain abstract grammar tree AST ^v Accessing an AST through a depth-first traversal algorithm ^v To obtain a syntax element sequence S corresponding to J ₂ ＝{W ₂ ^v V is equal to or greater than 1 and is equal to or less than V, wherein V=36000 and F ^v Represents the v-th JavaScript, W ₁ ^v 、W ₂ ^v Respectively represent F ^v A corresponding API call sequence, a grammar unit sequence;

(1b) Word2Vec is a model for Word vector conversion, and the Word2Vec model comprises an input layer, a hidden layer and an output layer which are connected in sequence, wherein the activation function of the output layer is a softmax function. For each API call sequence W, a Word2Vec model is utilized ₁ ^v Syntax element sequence W ₂ ^v Conversion to S ₁ API call sequence word vector of (c)S ₂ Is>

(1c) For each API call sequence word vectorAnd per syntax element sequence word vector +.>The reason for the marking by the same tag is +.>And->All from the same JavaScript script F ^v Thus->Andshould be marked by the same tag and T ₁ 80% of API call sequence word vectors and T ₂ More than half of grammar unit sequence word vectors and labels shared by every two vectors form a training sample set Q ₁ ThenWill T ₁ The remaining 20% of API call sequence word vectors and T ₂ The rest grammar unit sequence word vectors in the list and the labels common to every two vectors form a test sample set Q ₂ ；

Step 2) constructing a malicious JavaScript detection network model based on feature fusion;

constructing a malicious JavaScript detection network model comprising a convolutional neural network CNN and a two-way long-short-term memory neural network BiLSTM which are arranged in parallel, wherein the output ends of the CNN and the BiLSTM are connected with a feature fusion module and a feature classifier E in sequence; the CNN comprises a convolution layer and a maximum pooling layer which are overlapped with each other, wherein the number of convolution kernels in the convolution layer is 4, the sizes of the convolution kernels are 7, 15, 35 and 55 respectively, and the activation function of each convolution kernel is relu; the BiLSTM comprises a forward LSTM layer and a backward LSTM layer which are overlapped with each other, wherein the number of LSTM units in the forward LSTM layer and the backward LSTM layer is 128, and the activating function of the LSTM units in the LSTM layer is sigmoid; e comprises a full connection layer and a sigmoid activation function output layer;

step 3) carrying out iterative training on a malicious JavaScript detection network model based on feature fusion;

(3b) Will train sample set Q ₁ As the malicious JavaScript script based on feature fusion detects the input of the network model to forward propagate, the convolution kernels with different sizes in the convolution layer of CNN convolve along the grammar unit sequence, the maximum pooling layer reduces the dimension of the output result of the convolution layer, while the output result of the pooling layer is used as the high-dimension feature of the grammar unit sequence in each training sample; while the forward LSTM layer in BiLSTM processes the API call sequence, the output of the forward LSTM is used as the postInputs to the LSTM layer and outputs to the LSTM layer as high-dimensional features of the API call sequence in each training sample; the feature fusion module fuses each high-dimensional feature extracted by the CNN with the corresponding high-dimensional feature extracted by the BiLSTM, the high-dimensional features from the CNN and the high-dimensional features from the BiLSTM are spliced, and the feature classifier E maps each fused high-dimensional feature into a vector and inputs sigmoid to obtain the prediction probability of each training sample

(3c) Using cross entropy loss functionsAnd pass->Calculating a loss value L of a malicious JavaScript detection network model based on feature fusion in the iteration ⁱ The method comprises the steps of carrying out a first treatment on the surface of the By back-propagation and by loss value L ⁱ Calculating weight matrix gradient of malicious JavaScript script detection network model based on feature fusionOffset matrix gradient->By gradient descent method>Weight matrix->And offset matrix->Updating, wherein the updating formula of the weight matrix is +.>The update formula of the offset matrix is as follows

step 4) obtaining a detection result of the malicious JavaScript script:

In order to verify the technical effect of the malicious JS script detection method based on mixed analysis and feature fusion, the detection precision of the method is simulated, 18000 normal JavaScript scripts are extracted from a website with the click rate of 20, 18000 malicious JavaScript samples are obtained from a malicious software analysis platform, 18000 normal JavaScript scripts and 18000 malicious JavaScript scripts are mixed randomly to form 36000 JavaScript scripts as a JavaScript script set, 36000 grammar unit sequences and API call sequences are obtained by using a mixed analysis method, 36000 grammar unit sequences and API call sequences are converted into Word vector representations by using a Word2Vec model, and marking is carried out according to 8:2 to divide the training set and the test set.

According to the invention, a malicious JavaScript detection network model based on feature fusion is built on a host with a memory of 8G and an operating system of Ubuntu 18.04 by applying a Python3.6 and kersa deep learning framework, and the detection model is trained by utilizing a training set to obtain a trained detection model. The detection model is tested by using the test set as input, and the detection accuracy rate of the invention is 99.73%. The average accuracy of the prior art is 97.8%, which is 1.9% higher than that of the prior art, and the malicious JS script detection method based on mixed analysis and feature fusion provided by the invention is proved to be superior to the prior art.

Claims

1. A malicious JS script detection method based on mixed analysis and feature fusion is characterized by comprising the following steps:

(1) Acquiring a training sample set and a test sample set:

(1a) Obtaining V JavaScript scripts J= { F ^v V is more than or equal to 1, and each JavaScript F is operated ^v For running F ^v Dynamically monitoring the action of calling the application program interface API to obtain an API calling sequence S corresponding to J ₁ ＝{W ₁ ^v V is equal to or more than 1 and is equal to or less than V, and simultaneously, F is ^v Performing static grammar analysis to obtain abstract grammar tree AST ^v Accessing an AST through a depth-first traversal algorithm ^v To obtain a syntax element sequence corresponding to JWherein V is more than 10000, F ^v Represents the v-th JavaScript, W ₁ ^v 、/>Respectively represent F ^v A corresponding API call sequence, a grammar unit sequence;

(1b) For each API call sequence W, a Word2Vec model is utilized ₁ ^v Syntax element sequenceConversion to S ₁ API call sequence word vector T ₁ ＝{X ₁ ^v |1≤v≤V}、S ₂ Is>

(1c) Calling sequence word vector X for each API ₁ ^v And each syntax element sequence word vectorLabeling by the same tag and adding T ₁ More than half of API call sequence word vectors and T ₂ More than half of grammar unit sequence word vectors and labels shared by every two vectors form a training sample set Q ₁ Then T is taken ₁ The remaining API call sequence word vector sum T ₂ The rest grammar unit sequence word vectors in the list and the labels common to every two vectors form a test sample set Q ₂ ；

(3c) By cross entropy loss functionCalculating a loss value L of a malicious JavaScript detection network model based on feature fusion in the iteration ⁱ The method comprises the steps of carrying out a first treatment on the surface of the By back-propagation and by loss value L ⁱ Calculating weight matrix gradient of malicious JavaScript detection network model based on feature fusion>Offset matrix gradient->By gradient descent methodWeight matrix->And offset matrix->Updating;

(4) Obtaining a detection result of a malicious JavaScript script:

2. The malicious JS script detection method based on hybrid analysis and feature fusion of claim 1, wherein the Word2Vec model in step (1 b) includes an input layer, a hidden layer and an output layer connected in sequence, and an activation function of the output layer is a softmax function.

3. The malicious JS script detection method based on hybrid analysis and feature fusion of claim 1, wherein the malicious JavaScript script detection network model in step (2), wherein: the convolutional neural network CNN comprises 4 convolutional kernels in the convolutional layers, and the sizes of the convolutional kernels are 7, 15, 35 and 55 respectively; the number of LSTM units in the forward LSTM layer and the backward LSTM layer contained in the bi-directional long-short-term memory neural network BiLSTM is 128.

4. The malicious JS script detection method based on hybrid analysis and feature fusion according to claim 1, wherein the malicious JavaScript script detection network model in step (3 c) has a loss value L ⁱ Gradient of weight matrixOffset matrix gradient->Weight matrix->Offset matrix->The update formulas of (a) are respectively as follows:

wherein,the prediction probability of the real label of the mth sample and the prediction probability of the mth training sample are respectively represented, wherein the real label is 1, the sample is malicious, the real label is 0, the sample is normal, and the sample is->Representing the updated weight matrix of the model,representing the weight matrix before update, +.>Representing the updated offset matrix,/>Representing the offset matrix before update, alpha ₁ Representation parameters->Is a of the learning rate of (a) ₂ Representation parameters->Is a learning rate of (a).