CN116432184A

CN116432184A - Malicious software detection method based on semantic analysis and bidirectional coding characterization

Info

Publication number: CN116432184A
Application number: CN202310588930.2A
Authority: CN
Inventors: 赵运弢; 冯永新; 刘峻名
Original assignee: Shenyang Ligong University
Current assignee: Shenyang Ligong University
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-07-14

Abstract

Aiming at the problems that the traditional model detects that the malicious code has polysemous word representation and lacks context semantics, the invention provides a malicious software detection method based on semantic analysis and bidirectional coding characterization, which combines BERT with a convolution recursion network based on an external attention mechanism, uses a malicious software API function call sequence as a model learning feature, and carries out static analysis on the malicious software API function call sequence to detect the existing malicious software; using the API call function sequence to have correlation in context and semantics, using BERT for word representation task, and receiving semantic information from the sequence; the convolution neural network and the long-term memory network are respectively used for completing secondary feature extraction and chain relation mining between API functions; the attention mechanism is added after the network is memorized for a long time, so that key information in the text can be better focused, the influence of noise is reduced, and the accuracy in a text classification task is improved; the method is not influenced by the variation and deformation of malicious codes, and the accuracy reaches 98.81%.

Description

Malicious software detection method based on semantic analysis and bidirectional coding characterization

Technical Field

The invention belongs to the field of malware detection of computer security technology, and particularly relates to a malware detection method based on semantic analysis and bidirectional coding characterization.

Background

With the rapid development and popularization of information technology, computers have become an integral part of modern society. However, with the increasing complexity of computer application scenes, the security problem is also becoming more and more prominent, the types and the quantity of malicious software are rapidly expanding, and the propagation modes are continuously changing. Many problems including intrusion detection, virus classification, spam analysis, and phishing prevention have made network security a nightmare.

In recent years, advanced viruses and advanced persistent threat attacks against industrial control systems have become more frequent, and the detection of such viruses by a large number of variants has become more laborious based on fixed features, and the problem of industrial control system information security has also become more prominent. As network attacks become more complex, a variety of new malware, including Trojan horses, botnets, advertising software, and spyware, become more damaging and challenging. Virus species are also rapidly produced and updated, posing a greater threat to the internet. Atlas VPN team estimated 190 ten thousand malware for Linux in 2022, a 50% increase over the previous year. In the third quarter of the last year, 75841 malware samples aiming at Linux are increased by 91% in a same way; fourth quarter, there were 164697 samples, with a comparable increase of 117%. Unfortunately, classical security technologies such as anti-viruses cannot cope with the rapidly growing diversity of malware, which leaves people with doubt about the effectiveness and trustworthiness of the methods currently in use.

Today, globalization is possible in which everyone's computer becomes a victim. Moreover, the development of the internet of things has enabled everything to be connected to each other and exchange information over a network, but this also allows for a large spread of malware across multiple platforms of interconnected devices, and the internet of things ecosystem is also extremely vulnerable to a large amount of malware attacks by traditional computers and smartphones. In addition, the process of detecting malware attacks has become a challenging task due to the rapid adoption of Android platforms in mobile devices. In order to fundamentally solve the crisis caused by malicious software, the sustainability and the safety of the technical development of the Internet of things can be ensured only by continuously searching for new solutions and strengthening safety measures. Therefore, it is necessary to solve the shortcomings of the conventional malware analysis methods, develop a more effective solution, and provide an intelligent analysis method that is efficient and practical and can cope with malware changes.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a malicious software detection method based on semantic analysis and bidirectional coding characterization. From the detection efficiency of malicious codes, semantic analysis and bidirectional coding characterization are combined, so that the accuracy of detecting deformed malicious codes is improved, and the robustness of the model can be ensured when the model runs in different environments, platforms and operating systems; meanwhile, the process of manually marking data is omitted, the intrusion behavior of malicious codes can be accurately identified by utilizing the detection means based on the semantic relation and the context information of the data, and the safety and the stability of a computer system are ensured.

The malicious software detection method based on semantic analysis and bidirectional coding characterization comprises the following steps:

step 1: acquiring a malicious software data set, storing the malicious software data set in a CSV file form, and extracting an API function call sequence in the data set;

first, a malware dataset is downloaded, the dataset containing basic information for a plurality of malware, each basic information comprising the following features: sha256 hash value, label, header information, import function library, export function library, section information, character string information, sliding window entropy calculation, linker version, submission size, system version and subsystem version; wherein the imported function library contains a malicious software API function;

after the data set is obtained, aiming at basic information of each piece of malicious software, extracting an API function of each piece of malicious software from an import function library by using a Python third party library, and simultaneously reserving the sequence of the API function in the import function library to obtain a sequence composed of the API functions, namely, one piece of malicious software corresponds to one API function call sequence, and finally saving two fields of a family name of the malicious software and the corresponding API function call sequence into a CSV file;

step 2: performing word vectorization on the API function call sequence obtained in the step 1 by adopting a BERT model, so as to generate the feature of the word embedding type;

The BERT model consists of a plurality of convertors layers, and the BERT model based on the bidirectional coding characterization can treat the API function call sequence of each malicious sample as a text sentence with context semantics;

step 2.1: firstly, carrying out Unicode standardization on a malicious software API call sequence in a CSV file, then carrying out token operation on the API call sequence, dividing the sequence into single characters or some combined characters, and then adopting a word segmentation algorithm to segment the text after token;

step 2.2: constructing an input sequence of the BERT model; adding special tags to the malware API call sequence, the tags including [ CLS ] and [ SEP ], wherein the [ CLS ] tag represents the beginning of the sequence, and the [ SEP ] tag is used for separating different sentences or paragraphs; then converting the marked sequence into 768-dimensional embedded vectors, adding position codes to each embedded vector, representing the position of each API function in the calling sequence, and then randomly replacing the vector value corresponding to a specific API function in the embedded vector with the position codes by using a special MASK mark; finally, all the embedded vectors are formed into a plurality of batches and sent into a transducer model for further processing;

Step 2.3: adding a multi-head attention mechanism to all the embedded vectors obtained in the step 2.2;

step 2.3.1: generating a transformation matrix based on the embedded vector;

the BERT divides the embedded vector added with the mark into h parts, namely h attention heads, and each attention head calculates an attention weight matrix used for representing the correlation among different words in input; specifically, each attention header generates a query matrix Q, a key matrix K, and a value matrix V by using three linear transforms of the query vector, the key vector, and the value vector, specifically:

Q＝EW ^Q ,K＝EW ^K ,V＝EW ^V

wherein W is ^Q 、W ^K And W is ^V The vector matrix is a linear transformation matrix of different vectors, and E is a vector matrix formed by splicing all input vectors;

step 2.3.2: calculating 1 part of embedded vector, namely a single attention mechanism corresponding to 1 attention head;

the vector matrix E is used for calculation with the query matrix Q, the key matrix K and the value matrix V, so that single self-attention output is obtained, specifically:

wherein QK ^T Is query vector Q and all key vectors K ^T Similarity score between penalty factor d _k Is to prevent QK ^T The Softmax function is to normalize each row vector after operation in order to calculate the importance of each word to other words;

Step 2.3.3: calculating a multi-head attention mechanism of all embedded vectors;

after calculating the output of a single self-attention mechanism, the output of a multi-head attention mechanism is obtained as follows:

Attention _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

i.e. by means of a attentional mechanism, a weighted average is calculated, i.e. Attention _i The method comprises the steps of carrying out a first treatment on the surface of the Wherein the ith element represents the query matrix QW _i ^Q In the ith row and key matrix KW _i ^K All of (3)The score obtained by the inner product of the row is normalized by a Softmax function to obtain a weight, and finally the weight weighting summation is used to obtain the Attention _i ；

Step 2.3.4: splicing the output of the multi-head attention mechanism;

all the Attention matrixes are connected together according to columns to form a larger matrix, and the final output is obtained by linear transformation of the weight matrix, wherein the formula is as follows:

Y＝Multi(Q,K,V)＝concat(Attention ₁ ,Attention ₂ ,...,Attention ₈ )·W

wherein, multi (Q, K, V) represents performing Multi-headed attention calculation on the input query matrix Q, key matrix K and value matrix V; the condition (attach) represents that the attach matrix obtained by multi-head calculation is connected by columns to form a larger matrix; w represents a weight matrix; multiplying the weight matrix by the multi-head Attention matrix after concat to obtain a weighted and summed query vector;

step 2.3.5: carrying out layer standardization processing on the weighted and summed query vectors; the method comprises the following two steps: the first step is to perform translation operation, namely adding a bias term bias to the result after linear transformation; the second step is to perform scaling operation, i.e. dividing by one standard deviation;

Finally multiplying the normalized vector with a value matrix to obtain an attention score, and multiplying the attention score with the value matrix to obtain a final output vector;

step 2.3.6: residual connection is carried out on the output vector and the input embedded vector, and the output vector after residual connection is used as the input of the next step;

step 3: constructing a ConvLSTM neural network architecture;

combining a convolutional neural network CNN, a long and short-term memory network LSTM and an external attention mechanism to construct a ConvLSTM neural network architecture, taking the embedded vector after the residual connection in the step 2 as the input of the ConvLSTM architecture, and training the ConvLSTM architecture;

step 3.1: establishing a plurality of one-dimensional convolutional neural networks CNN for extracting local features;

the formula of CNN for local feature extraction is as follows:

wherein p (i) represents the value of the ith node in the cell matrix; wi (w) _x，y The method comprises the steps of representing the weight of a filter input node (x, y) of an ith node in an output unit node matrix, and representing a bias term parameter corresponding to the ith output node by bi; c _x，y Is the value of the node (x, y) in the filter; f is an activation function; the unit vectors of all p (i) are feature maps obtained from the convolutional layer and are denoted as p, the input of the LSTM network architecture;

Step 3.2: after the CNN network is built, an LSTM network architecture is further introduced to process the long sequence, and the LSTM uses a gating unit to realize the latest memory of the stored API call sequence, wherein the latest memory comprises a forgetting gate, an input gate and an output gate component; the gating unit decides which gate is used according to the sequence extracted by the CNN, and adjusts the triggering of the gate by utilizing an S-shaped activation function;

step 3.3: adding an external attention mechanism, weighting the output value of each time step, and reinforcing the weight of useful information; the CNN, LSTM and external attention mechanisms are fused, a ConvLSTM neural network architecture is finally constructed, and a softmax function is applied to complete detection and classification of malicious software;

step 4: training and optimizing ConvLSTM neural network architecture;

step 4.1: adding a cross-validation optimization model aiming at a ConvLSTM neural network architecture; firstly, randomly dividing an API function call data set after word vector into a training set and a testing set, and adding a cross verification method; in the cross validation process, firstly, randomly dividing an API function call sequence data set into a plurality of subsets with equal size, wherein one subset is used as a validation set, and the rest subset is used as a training set; the remaining subsets are then used to train the model for each subset and tests are performed on that subset; this process is repeated a plurality of times, and different subsets are selected as test sets each time, so that model performance indexes tested by using the different subsets are finally obtained;

Step 4.2: EDA data enhancement method is added for the data set; EDA generates new training data by randomly transforming the original API sequence data; multiple transformations were performed on each sample, specifically using the following three transformations:

step 4.2.1: randomly inserting; randomly selecting an API function from a certain API call sequence, and inserting an automatically generated API function at the position;

step 4.2.2: randomly deleting; randomly selecting a function from the API call sequence and deleting the function from the function;

step 4.2.3: random exchange; randomly selecting two adjacent API functions and exchanging their positions;

step 4.2.4: generating and storing a new data set; mixing the transformed sample with the original sample to form a new data set to be stored in a file;

step 5: evaluating the ConvLSTM neural network architecture optimized in the step 4, and testing the efficiency and accuracy of malware detection;

three evaluation indexes for classifying problems are selected to evaluate the ConvLSTM neural network architecture, specifically, accuracy, F1 score and loss value;

step 6: constructing a malicious software API call sequence detection system, and visualizing a detection result; specifically, a related library used for constructing a visual interface in Python is used for constructing a visual system platform;

Step 6.1: designing a system interface; detecting system requirements and functions according to the API call sequence, and constructing a user interface meeting the system requirements; then, performing interface layout design by using a visualization technology in Python;

step 6.2: adding system functions; the functions comprise selecting an uploading file, selecting a model, displaying a predicted result image, and displaying a single-sequence predicted window;

step 6.2.1: selecting an uploading file function to realize; firstly, creating a dialogue box of a file by using a visualization component in python, setting the type of the uploaded file, then setting an action button for opening the dialogue box of the file, acquiring the path and the name of the selected file, and finally visualizing the data of the file by using a presentation function;

step 6.2.2: realizing a model selection function; firstly, placing a trained model under a designated folder, displaying the name of the model by using a drop-down frame component, and completing model selection by mouse selection;

step 6.2.3: realizing a prediction result display window function; after the data set and the model are selected, a visual component is used for creating a predicted result display window, a predicted result display button is created, and the model loss value and the accuracy are obtained, so that the predicted result button is clicked to obtain the model loss value and the model accuracy and display the model loss value and the model accuracy in the window;

Step 6.2.4: the function of a prediction result image display window is realized; creating a predicted result image display window by using a visualization component, creating a display result image button, and acquiring a loss value and an accuracy image, so that clicking the result image button displays the loss value and the accuracy image;

step 6.2.5: realizing a single sequence prediction window function; a visualization component is used to create a single sequence prediction display window and create a display single sequence prediction result display button, and single sequence predictions are obtained so that the single sequence prediction button displays the corresponding tag and malicious family category of the sequence.

The invention has the beneficial technical effects that:

the method analyzes the importance degree and actual requirement of the context-based semantic analysis on the detection of the malicious code API call sequence, and knows that the API call sequence is irrelevant to a specific virus form and an execution environment and has great universality; secondly, according to the related technology based on semantic analysis models and malicious code detection in the past, comparing the advantages and disadvantages of different semantic analysis models, and providing a detection model based on context semantic analysis and bidirectional coding characterization; on the basis, constructing a ConvLSTM model to finish the detection of the API call sequence of the malicious code; finally, a malicious code API call sequence detection system is built by using a PyQt technology, and a more visual malicious code detection effect is provided.

Drawings

FIG. 1 is a flowchart of a method for malware detection based on semantic analysis and bi-directional coding characterization in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of an input sequence for BERT construction according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an API call sequence for vectorization of BERT generation words in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-head attention mechanism added by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a ConvLSTM neural network architecture according to an embodiment of the present invention;

FIG. 6 is a diagram of a malware API call sequence detection system interface in accordance with an embodiment of the present invention;

FIG. 7 is a single sequence prediction window interface diagram of a malware API call sequence detection system in accordance with an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples;

the method for detecting the malicious software based on semantic analysis and bidirectional coding characterization is shown in the whole flow chart as shown in fig. 1, and comprises the following steps:

step 1: acquiring a malicious software data set, and extracting an API function call sequence in the data set;

firstly downloading a malicious software data set related to an API function call sequence from a malicious software functional network, wherein the data set contains basic information of a plurality of malicious software, and each basic information contains the following characteristics such as sha256 hash value, label, header information, import function library, export function library, section information, character string information, sliding window entropy calculation, linker version, submission size, system version and subsystem version; the imported function library comprises a large number of malicious software API functions;

After the data set is downloaded and acquired, extracting the API function of each malicious software from the import function library by using a Python third party library, and simultaneously reserving the sequence of the API function in the import function library to obtain a sequence composed of the API functions, namely, one malicious software corresponds to one API function call sequence, and finally saving two fields of the family name of the malicious software and the corresponding API function call sequence into a CSV file;

step 2: and carrying out word vectorization on the API function call sequence by adopting the BERT model to generate the feature of the word embedding type.

As shown in fig. 3;

the BERT model consists of a plurality of convertors layers, and the BERT model based on the bidirectional coding characterization can treat the API function call sequence of each malicious sample as a text statement with context semantics;

step 2.2: constructing an input sequence of the BERT model, as shown in figure 2; in order for the BERT model to understand the incoming malicious code API call sequence, special tags need to be added to the sequence, including [ CLS ] and [ SEP ], where the [ CLS ] tag indicates the beginning of the sequence and the [ SEP ] tag is used to separate different sentences or paragraphs; then converting the marked sequence into 768-dimensional embedded vectors, adding position codes to each embedded vector, wherein the position codes are used for representing the positions of each API function in the calling sequence, and adding the position codes to the embedded vectors can enable the subsequent transducer model to have the capability of learning word sequences, namely enabling the transducer model to capture the relative positions of the API functions in the input sequence; then using special MASK marks to randomly replace vector values corresponding to a specific API function in the embedded vector added with the position codes; finally, all the embedded vectors are formed into a plurality of batches and sent into a transducer model for further processing;

Step 2.3: adding a multi-headed attention mechanism to the embedded vector as shown in fig. 4;

step 2.3.1: generating a transformation matrix based on the embedded vector;

the BERT divides the labeled embedded vector into h parts, i.e., h attention heads, each of which computes an attention weight matrix that characterizes the correlation between different words in the input. Specifically, each attention header uses three linear transformations (i.e., query vector, key vector, value vector) to generate a query matrix Q, a key matrix K, and a value matrix V, as follows:

Q＝EW ^Q ,K＝EW ^K ,V＝EW ^V

wherein W is ^Q 、W ^K And W is ^V Is a linear transformation matrix of different vectors, and E is a vector matrix formed by splicing all input vectors.

Step 2.3.2: a single attention mechanism is calculated.

Then, the vector matrix E and 3 attention weight matrices are used for calculation, so that the self-attention output is obtained, and the formula is as follows:

wherein QK ^T Is query vector Q and all key vectors K ^T Similarity score between penalty factor d _k Is to prevent QK ^T The Softmax function is normalized for each line vector after the operation in order to calculate the importance of each word to other words.

Step 2.3.3: a multi-headed attention mechanism is calculated.

Attention _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

specifically, a weighted average, i.e., attention, can be calculated by the Attention mechanism _i . Wherein the ith element represents the query matrix QW _i ^Q In the ith row and key matrix KW _i ^K The score obtained by the inner product of all the rows in the table is normalized by a Softmax function to obtain a weight, and finally the weight weighting summation is used to obtain the Attention _i 。

Step 2.3.4: output of spliced multi-head attention mechanism

Finally, all the Attention matrixes are connected together according to columns to form a larger matrix, and the final output is obtained by linear transformation of the weight matrix, wherein the formula is as follows:

Y＝Multi(Q,K,V)＝concat(Attention ₁ ,Attention ₂ ,...,Attention ₈ )·W

wherein, multi (Q, K, V) represents performing Multi-headed attention calculation on the input query matrix Q, key matrix K and value matrix V; the condition (attach) represents that the attach matrix obtained by multi-head calculation is connected by columns to form a larger matrix; w represents the weight matrix, and the final output matrix Y is obtained by linear transformation on the connected matrix.

Step 2.3.5: layer standardization treatment;

then carrying out layer standardization processing on the weighted and summed query vectors; this process is divided into two steps: the first step is to perform translation operation, namely adding a bias term bias to the result after linear transformation to further emphasize the expression capacity of the model; the second step is to perform a scaling operation, i.e. dividing by one standard deviation, so that the outputs have the same variance; the process improves the convergence speed and stability of the model, so that different input data can be better dealt with;

step 2.3.6: residual connection;

the output vector is in residual connection with the input embedding, so that the model can learn information of different layers and different aspects in the input better. And performing normalization operations after residual connection, including batch normalization and residual normalization, to improve the stability and training effect of the BERT model.

Step 3: constructing ConvLSTM neural network architecture.

The invention designs that the BERT model realizes the increment fine adjustment by combining with ConvLSTM architecture, and as shown in figure 5, a ConvLSTM neural network architecture schematic diagram is described.

Step 3.1: a plurality of one-dimensional convolutional neural networks CNN are established for extracting local features. On one hand, a plurality of convolution kernels are used for fully extracting the features with stronger discrimination; on the other hand, the parameters of the convolution layer are reduced, and the running time is shortened.

The API function call sequence of malware may be considered an operational instruction sequence that may be attempted to be processed using a network model of text classification and emotion analysis. CNNs are able to summarize feature local predictors in a given structure, combining them to generate a feature matrix representing the structure. It is able to extract local features of different sizes by setting the size of the different filter kernels. The output vector matrix T of the BERT layer is used as an input to the CNN and the convolution kernel slides over the sentence word vector matrix. The convolution kernels are multiplied by the corresponding elements on the corresponding sentence word vector matrix window and then summed, their values being used as eigenvalues of the final eigenvector matrix, all eigenvalues forming the eigenvector graph.

The present design uses a multi-kernel approach with three filters of

sizes

2, 3 and 4, respectively, to fuse the convolution layers, and the number of convolution kernels for each size is selected to be 128 in order to extract different text features. The filter converts the 3 x 1 node matrix into an identity node matrix. The formula of CNN for local feature extraction is as follows:

wherein p (i) represents the value of the ith node in the cell matrix; wi (Wi) _x，y Weights of filter input nodes (x, y) for representing the ith node in the output cell node matrix, and representing the bias corresponding to the ith output node using biSetting item parameters; c _x，y Is the value of the node (x, y) in the filter; f is the activation function. The unit vectors of all p (i) are feature maps obtained from the convolutional layer and denoted as p, which is the input of the next layer.

Step 3.2: after the CNN network is built, an LSTM network architecture is further introduced to process long sequences, so that the problems of gradient disappearance, gradient explosion and the like in the traditional cyclic neural network are solved. Unlike normal neurons, LSTM uses gating units to implement the latest memory to store API call sequences, including forget gates, input gates, and output gates, among other components. The gating unit decides which gate to use according to the sequence extracted by the CNN, and adjusts the triggering of the gate by using an S-shaped activation function, thereby realizing the accurate control of the data and information flow inside the unit.

In the design of the invention, after the LSTM network is added, the weight and the bias of the LSTM are initialized randomly. For each time step, the LSTM network first calculates an input gate to determine which information should be passed to the LSTM unit. Next, a forget gate is passed to determine which history information should be retained. Then, by multiplying the results of the input gate and the forget gate and adding the input of the current time step, a new memory cell state can be obtained. This state is preserved and passed on to the next time step. Finally, the LSTM calculates an output gate to determine what the output of the time step should be. Using this result, the memory cell state is passed to a tanh activation function to obtain a new output. When parameter tuning is performed, the difference between the predicted output and the actual output is calculated first, and the weight and bias of the LSTM are updated using a back propagation algorithm. Thus, the LSTM model gradually learns the pattern of the API call sequence data.

Step 3.3: an external attention mechanism is added after the LSTM architecture, weighting the output value for each time step and reinforcing the weight of the useful information. Finally, the software function is applied to complete the detection and classification process of the malicious software.

Step 4: the ConvLSTM neural network architecture is optimized.

Step 4.1: adding cross-validation optimizes the ConvLSTM neural network architecture. Firstly, randomly dividing an API function call data set after word vector into a training set and a testing set according to the proportion of 8:2, and adding a ten-fold cross-validation method to improve the performance of a ConvLSTM neural network architecture. In the cross validation process, the API function call sequence data set is firstly divided into 10 subsets with equal size randomly, wherein one subset is used as a validation set, and the rest subset is used as a training set. The remaining subsets are then used to train the ConvLSTM neural network architecture for each subset and test is performed on that subset. This process was repeated 10 times, selecting different subsets each time as test sets, resulting in performance metrics for 10 ConvLSTM neural network architectures.

Step 4.2: EDA data enhancement methods were added. The EDA generates new training data by randomly transforming the original API sequence data. Here, each sample is transformed multiple times, and the present invention is designed to use the following three transforms.

Step 4.2.1: randomly inserting. An API function is randomly selected from a sequence of API calls and an automatically generated API function is inserted at that location.

Step 4.2.2: and randomly deleting. A function is randomly selected from the sequence of API calls and deleted therefrom.

Step 4.2.3: and (5) random exchange. Two API functions that are adjacent are randomly selected and their locations are swapped.

Step 4.2.4: a new data set is generated and stored. And mixing the transformed samples with the original samples to form a new data set, and storing the new data set in a file.

Step 5: and evaluating the ConvLSTM neural network architecture, and testing the efficiency and accuracy of malware detection.

The invention designs and selects the following three evaluation indexes for classifying problems to evaluate, namely accuracy, F1 score and loss value. The present invention defines the following numbers: TP (true positive), FN (false negative), FP (false positive), and TN (true negative).

Accuracy is an evaluation index in classification tasks and is used for measuring the Accuracy degree of model prediction. It represents the ratio of the number of samples correctly predicted by the model to the total number of samples.

The F1 score is a comprehensive evaluation index in the classification task and combines two indexes of Precision and Recall of the model. F1 score is the harmonic mean of accuracy and recall. Where Precision refers to how many of the positive samples that the model predicts are true positive samples, and Recall refers to how many of the samples that are actually positive samples are correctly predicted by the model. The F1 score is the harmonic mean of these two indicators,

The calculation formula of the F1 score is as follows:

for the loss function, a cross entropy loss function is used. The method can measure the performance of the classification model and can represent the difference between the real sample label and the prediction probability. The smaller its difference, the better the prediction effect. The cross entropy H (p, q) of the probability distribution p compared to the probability distribution q is calculated as follows:

in order to verify the effectiveness of the design method of the invention, different classification models are compared in the same experimental environment, and the final results are shown in table 1.

Table 1 model evaluation result diagram

Model 1 is the result of directly connecting the output of BERT to the fully connected layer, which is then input to the Softmax classifier; comparison with model 2 shows that after 10 times of cross-validation, the accuracy of the model is greatly improved, which indicates that cross-validation is necessary; under the same detection classifier framework, the BERT model can solve the problem of the representation of ambiguities compared with static embedded models such as Word2Vec and Embedding. The method uses a transducer model with strong text fusion capability as a substructure of a pre-training model, thereby greatly improving the model analysis capability. Compared with the single feature extraction model of BERT, the mixed model used in the method can better extract text features, has higher classification precision and improves various evaluation indexes. The final experimental results show that the BERT-based ConvLSTM model proposed herein is the best in terms of accuracy and loss function value, with an accuracy of 98.81% and a loss value of 0.03.

Step 6: and constructing a malicious software API call sequence detection system, and visualizing a detection result. The invention designs a system construction by using a PyQt5 library, and an interface of a malicious software API call sequence detection system is shown in FIG. 6.

Step 6.1: and designing a system interface and selecting proper layout and control. And detecting system requirements and functions according to the API call sequence, selecting QComboBox, QPushButton, QLabel and other suitable controls in the PyQt, and helping to construct a user interface meeting the system requirements. The PyQt framework is then used for interface layout design. The present invention contemplates using QWIdget to create the main window and then placing other controls in the main window. The window is divided into different areas according to the system requirements and functions, and corresponding controls are placed in each area. And then adding an event processing program for the PyQt control, so that the PyQt control can respond to the user operation and realize corresponding interaction functions. And finally, after the interface design is completed, the initial test and layout optimization of the visual interface are required.

Step 6.2: system functions are added. Mainly adding a selected uploading file, model selection, a predicted result display window, a predicted result image display window, a single-sequence predicted window and the like.

Step 6.2.1: and selecting the realization of the file uploading function. Firstly, a QF ileDialog module is required to be imported into a Python code, then QF ileDialog is used for creating a dialogue box of a file, a CSV file to be uploaded is selected in the dialogue box, then a file dialogue box is opened by using a getOpenFileName () method, the path and the name of the selected file are obtained, and finally the API function call sequence of the file is visualized by using a show function.

Step 6.2.2: the model selection function is implemented. Firstly, putting the trained models under a designated folder, then displaying the names of the models by using a drop-down frame in PyQt, and completing model selection through mouse selection.

Step 6.2.3: and the prediction result shows the window function realization. After the data set and model are selected, QWIdget is used to create a predicted result display window, signals and slots are used to connect the 'predicted result display' button with the corresponding result display function, and the predicted result is displayed. The name of the current model, the data size, the loss value of the test set and the accuracy can be obtained.

Step 6.2.4: and realizing a prediction result image display window function. The QWIdget is also used to create a predictive results image display window, and a response event is set so that the test set accuracy map and loss value map of the current model can be obtained after clicking the "image display" button.

Step 6.2.5: single sequence prediction window function implementation as shown in fig. 7, a single sequence prediction window interface diagram is presented. Creating the same uploading file function as in the step 6.2.1, connecting the event 'predicting the line' with the corresponding slot function by using signals and slots to finish the selection of a single sequence and display the API function contained in the single sequence, and finally detecting the single API sequence by using the model selected in the step 6.2.2 to give the corresponding label and the malicious family category.

Claims

1. The malicious software detection method based on semantic analysis and bidirectional coding characterization is characterized by comprising the following steps of:

step 3: constructing a ConvLSTM neural network architecture;

step 4: training and optimizing ConvLSTM neural network architecture;

Step 6: constructing a malicious software API call sequence detection system, and visualizing a detection result; and building a visual system platform by using a related library used for building a visual interface in Python.

2. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein step 1 is specifically:

after the data set is obtained, aiming at basic information of each piece of malicious software, using a Python third party library to extract the API function of each piece of malicious software from the import function library, and simultaneously reserving the sequence of the API function in the import function library to obtain a sequence composed of the API functions, namely, one piece of malicious software corresponds to one piece of API function call sequence, and finally saving two fields of the family name of the malicious software and the corresponding API function call sequence into a CSV file.

3. The method for detecting malware based on semantic analysis and bi-directional coding characterization according to claim 1, wherein the BERT model in step 2 is composed of a plurality of convertors layers, and the bi-directional coding characterization based BERT model treats API function call sequences of each malicious sample as text sentences with context semantics; the step 2 is specifically as follows:

Step 2.3: a multi-head attention mechanism is added to all the embedded vectors obtained in step 2.2.

4. A method for malware detection based on semantic analysis and bi-directional coding characterization according to claim 3, wherein step 2.3 is specifically:

step 2.3.1: generating a transformation matrix based on the embedded vector;

the BERT divides the embedded vector added with the mark into h parts, namely h attention heads, each attention head calculates an attention weight matrix, and the relevance among different words in the input is represented; specifically, each attention header generates a query matrix Q, a key matrix K, and a value matrix V by using three linear transforms of the query vector, the key vector, and the value vector, specifically:

Q＝EW ^Q ,K＝EW ^K ,V＝EW ^V

wherein QK ^T Is query vector Q and all key vectors K ^T Similarity score between penalty factor d _k Is to prevent QK ^T The Softmax function is to normalize each row vector after operation and calculate the importance of each word to other words;

Attention _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V )

i.e. by means of a attentional mechanism, a weighted average is calculated, i.e. Attention _i The method comprises the steps of carrying out a first treatment on the surface of the Wherein the ith element represents the query matrix QW _i ^Q In the ith row and key matrix KW _i ^K The score obtained by the inner product of all the rows in the table is normalized by a Softmax function to obtain a weight, and finally the weight weighting summation is used to obtain the Attention _i ；

Step 2.3.4: splicing the output of the multi-head attention mechanism;

Y＝Multi(Q,K,V)＝concat(Attention ₁ ,Attention ₂ ,...,Attention ₈ )·W

step 2.3.6: and carrying out residual connection on the output vector and the input embedded vector, wherein the output vector after residual connection is used as the input of the next step.

5. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein the constructing a ConvLSTM neural network architecture in step 3 specifically comprises:

the formula of CNN for local feature extraction is as follows:

step 3.3: adding an external attention mechanism, weighting the output value of each time step, and reinforcing the weight of useful information; and finally constructing a ConvLSTM neural network architecture by fusing CNN, LSTM and an external attention mechanism, and completing detection and classification of the malicious software by applying a softmax function.

6. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein step 4 is specifically:

step 4.2.4: generating and storing a new data set; and mixing the transformed samples with the original samples to form a new data set, and storing the new data set in a file.

7. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 1, wherein step 6 is specifically:

step 6.2: adding system functions; the functions comprise selecting an uploading file, selecting a model, displaying a predicted result image and displaying a single-sequence predicted window.

8. The method for detecting malicious software based on semantic analysis and bi-directional coding characterization according to claim 7, wherein step 6.2 is specifically: