CN110472417B - Convolutional neural network-based malicious software operation code analysis method - Google Patents

Convolutional neural network-based malicious software operation code analysis method Download PDF

Info

Publication number
CN110472417B
CN110472417B CN201910776705.5A CN201910776705A CN110472417B CN 110472417 B CN110472417 B CN 110472417B CN 201910776705 A CN201910776705 A CN 201910776705A CN 110472417 B CN110472417 B CN 110472417B
Authority
CN
China
Prior art keywords
vector
operation code
neural network
convolutional neural
loss function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910776705.5A
Other languages
Chinese (zh)
Other versions
CN110472417A (en
Inventor
陈璨
赵立超
李丹
史闻博
庄宇鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University Qinhuangdao Branch
Original Assignee
Northeastern University Qinhuangdao Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University Qinhuangdao Branch filed Critical Northeastern University Qinhuangdao Branch
Priority to CN201910776705.5A priority Critical patent/CN110472417B/en
Publication of CN110472417A publication Critical patent/CN110472417A/en
Application granted granted Critical
Publication of CN110472417B publication Critical patent/CN110472417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Abstract

The invention discloses a method for analyzing malicious software operation codes based on a convolutional neural network, which comprises the following steps: acquiring Dalvik byte codes; acquiring an operation code sequence and representing the operation code sequence by using a one-hot vector; converting the unique heat vector into a vector with a fixed size, multiplying the vector by a random weight matrix, and inputting the vector to a convolutional neural network; outputting a feature mapping set matrix C in the convolutional layer; in k-max pooling, performing maximum merging operation on the matrix C, extracting the most important k eigenvalues and outputting an eigenvector Z; the vector Z forms a full connection layer, and the vector Z is operated in the full connection layer to obtain an output characteristic y; processing the output characteristic y by using a softmax function to obtain a relative probability distribution p; calculating a cross entropy loss function Lk; gradually adjusting the minimum loss function and the parameter values of the corresponding model by using a gradient descent method; iteratively updating the model parameters and optimizing the detection model based on the output calculations. The invention has the characteristic of high detection accuracy.

Description

Convolutional neural network-based malicious software operation code analysis method
Technical Field
The invention relates to the field of malicious software detection, in particular to a malicious software operation code analysis method based on a convolutional neural network.
Background
Currently, the Android malware mainly comprises Hindroid of static analysis, which is a method for linking application programs based on a meta path and a system for dynamically identifying Android. The methods have the defects of insufficient accuracy, low classification success rate, low efficiency, incomplete extracted operation code sequence and the like.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a convolutional neural network-based malicious software operation code analysis method which has the characteristic of high detection accuracy.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a malicious software operation code analysis method based on a convolutional neural network comprises the following steps:
s1, obtaining a training sample; the training sample is an executive of a known type of software, the type including benign and malicious;
s2, performing decompiling processing on the training sample to obtain a smali file of the training sample, and acquiring Dalvik byte codes from the smali file;
s3, acquiring an operation code sequence file according to the Android operation code constant list, and representing the operation code sequence file by using a unique heat vector;
s4, converting the unique heat vector into a vector with a fixed size in the embedding layer;
s5, multiplying the converted unique heat vector by a random weight matrix to generate a matrix M, and inputting the matrix M into a convolutional neural network;
s6, performing convolution processing on the matrix M in the convolution layer by using a convolution kernel to obtain feature mapping, and extracting feature mapping set matrixes C of different convolution kernels;
s7, in k-max pooling, performing maximum merging operation on the matrix C, and extracting the most important k characteristic values to obtain a characteristic vector Z of the hidden layer;
s8, forming a full connection layer by using the vector Z with a fixed size, and operating the vector Z in the full connection layer to obtain an output characteristic y; processing the output characteristic y by using a softmax function, converting the multi-class output values into relative probabilities, and obtaining relative probability distribution p;
s9, calculating a cross entropy loss function L according to the obtained probability distribution p and label probability distributionk
S10, according to the calculated cross entropy loss function LkEstimating the difference between the predicted value and the true value of the model, and optimizing by using a gradient descent method to obtain an optimal Android program detection model;
s11, iteratively updating model parameters and optimizing the Android program detection model based on output calculation and the minimized loss function.
Further, in the operation code sequence represented by the unique heat vector in step S3, each operation code corresponds to one of the 256 positions, the operation code that appears is represented as 1, and the operation code that does not appear is represented as 0.
Further, in step S7, k is 3 or 5.
Further, in step S8, the vector Z is operated in the full connection layer, the sum of the weights of the previous layer is calculated, and the accurate combination of each element is obtained to obtain the output feature y; the softmax function processes the output features y in the manner of a process classifier.
Further, in step S10, when the loss function is minimized, the minimized loss function and the parameter values of the corresponding model are gradually adjusted by using a gradient descent optimization method, so as to obtain an optimal Android program detection model.
Adopt the produced beneficial effect of above-mentioned technical scheme to lie in:
the method adopts the convolutional neural network to realize effective android software static feature extraction, adopts a static analysis tool to extract the operation code sequence from the apk file of the android software, and is more complete and efficient than the existing method; the invention utilizes k-max pooling strategy to realize the acquisition of relative position information of the operation code, the adopted k-max pooling can take the best k from all characteristic values and reserve the original orders of the characteristic values, the characteristics of weak information are eliminated, important characteristics are reserved, and the relative position information of the characteristics is also reflected. The invention trains the classification model using cross entropy loss, which can better evaluate the quality of the neural network than using mean square error.
Drawings
FIG. 1 is a flow chart of an assay method of the present invention;
FIG. 2 is a flowchart of the apk pre-processing operation of the present invention;
FIG. 3 is a schematic flow chart of the detection system proposed by the present invention;
FIG. 4 is a diagram of the generation of the convolutional neural network input of the present invention;
fig. 5 is an example of k-max (k-2) pooling of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the present invention discloses a method for analyzing malware operation codes based on a convolutional neural network, which comprises the following steps:
s1, obtaining a training sample; the training sample is an executive of a known type of software, the type including benign and malicious;
s2, performing decompiling processing on the training sample by using the apktool to obtain a smali file of the training sample, acquiring Dalvik byte codes from the smali file, and discarding an operand; the apk pre-processing procedure is shown in FIG. 2.
S3, acquiring an operation code sequence file according to the Android operation code constant list, wherein an operation code sequence vector is represented by X (X1, X2., Xn), wherein n is the operation code length of apk and is represented by n X o (o is 256) unique heat vector;
and S4, converting the unique heat vector into a vector with a fixed size in the embedding layer so as to improve the calculation efficiency and enable apk information to be more concentrated and unified.
S5, converting the converted unique heat vector XiMultiplying by a random weight matrix We∈Ro×g(o 256, g is the dimension of the embedding space), generating a matrix M e Rn×gInputting to a convolutional neural network; the convolutional neural network input is generated as shown in fig. 4.
S6, performing convolution processing on the matrix M in the convolution layer by using a convolution kernel to obtain feature mapping, and extracting feature mapping set matrixes C of different convolution kernels;
the specific convolution operation of the convolution kernel is defined as follows:
Cj=f1(Conv(M,wj)+bj)
wherein b isjAnd wjIs the corresponding deviation parameter and weight parameter of the jth convolution kernel, f1For the corresponding RELU activation function, the j convolution kernels perform a complete convolution operation on the n-h +1 window, and the extracted feature maps of the different convolution kernels are summarized as follows:
C=[c1|c2…|cp]T
s7, as shown in FIG. 5, in k-max pooling, performing maximum merging operation on the matrix C, extracting the most important k feature values, and obtaining a feature vector Z of the hidden layer, which are described as follows:
Figure GDA0002852854240000051
s8, forming a full connection layer by using the vector Z with a fixed size, and operating the vector Z in the full connection layer to obtain an output characteristic y; this reduces the burden on the model and also prevents the model from being used excessively. Processing the output characteristic y by using a softmax function, converting the multi-class output values into relative probabilities, and obtaining relative probability distribution p;
the operation of outputting y is as follows:
y=f2(Wf·z)+b′
wherein WfAnd b' is the weight matrix and deviation of the fully-connected layer, function f2Is the RELU activation function.
The formula of the softmax function is defined as follows:
p(l=i|y)=expyi/∑yi
wherein y isiIs the output of the ith neuron of the fully-connected layer, and is the label of the data sample, indicating that the data sample is a malicious sample or a benign sample.
S9, calculating a cross entropy loss function L according to the obtained probability distribution p and label probability distributionk(ii) a The invention uses the cross entropy loss function, and can better evaluate the quality of the neural network than the mean square error.
Loss function LkIs defined as follows:
Figure GDA0002852854240000052
wherein L iskDenotes a loss function calculated by inputting the kth sample, and p (i ═ i | y) is the ith value of the softmax function output vector, and denotes the probability that the sample belongs to the ith class. Label liPreceded by a summation symbol ranging from 1 to the number of classes T. T2 represents the number of outputs of the full connection layer.
L represents the average of the loss function for a batch of b data samples, which can be interpreted as the difference between the predicted value and the true value obtained by modeling, defined as follows:
Figure GDA0002852854240000061
s10, according to the calculated cross entropy loss function LkEstimating the difference between the predicted value and the true value of the model, and optimizing by using a gradient descent method to obtain an optimal Android program detection model;
s11, iteratively updating model parameters and optimizing the provided Android program detection model based on output calculation and a minimized loss function;
s12, after the model training is finished, inputting a new test set test model.
In step S3, in the operation code sequence represented by the unique heat vector, each operation code corresponds to one of the 256 positions, the operation code that appears is represented by 1, and the operation code that does not appear is represented by 0.
In step S7, k is 3 or 5. And when k is 3 or 5, the effect is optimal, so that the characteristics of weak information are eliminated, important characteristics are reserved, and the overfitting problem of the model is reduced.
In step S8, the vector Z is operated in the full connection layer, the sum of the weights of the previous layer is calculated, and the accurate combination of each element is obtained to obtain the output feature y; the softmax function processes the output features y in the manner of a process classifier.
In step S10, when the loss function is minimized, the minimized loss function and the parameter values of the corresponding model are gradually adjusted by using a gradient descent optimization method, so as to obtain an optimal Android program detection model.
The analysis method is realized by adopting Python, and mainly used experiment platforms comprise: pandas, sklern, matplotlib, numpy. The system development environment is 64-bit Ubuntu 17.10. The proposed neural network detection model was developed using Tensorflow and Torch framework environments, and the Tesla K80 GPU was used for the experiments and training of the detection method.
The method adopts the convolutional neural network to realize effective android software static feature extraction, and adopts a static analysis tool to extract an operation code sequence from an apk file of the android software; the k-max pooling strategy is utilized in the convolutional layer, the acquisition of relative position information of the operation codes is realized, the variable length sequence is organized into a fixed length to be input, the number of model parameters is reduced, and the overfitting problem of the model is favorably reduced. And finally, training a classification model by using cross entropy loss to better evaluate the quality of the neural network. Compared with other methods, the method has certain advantages in accuracy, precision and other indexes through experiments on different data sets, and the method has good performance on different data sets.
The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solution of the present invention by those skilled in the art should fall within the protection scope defined by the claims of the present invention without departing from the spirit of the present invention.

Claims (5)

1. A malicious software operation code analysis method based on a convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
s1, obtaining a training sample; the training sample is an executive of a known type of software, the type including benign and malicious;
s2, performing decompiling processing on the training sample to obtain a smali file of the training sample, and acquiring Dalvik byte codes from the smali file;
s3, acquiring an operation code sequence file according to the Android operation code constant list, and representing the operation code sequence file by using a unique heat vector;
s4, converting the unique heat vector into a vector with a fixed size in the embedding layer;
s5, multiplying the converted unique heat vector by a random weight matrix to generate a matrix M, and inputting the matrix M into a convolutional neural network;
s6, performing convolution processing on the matrix M in the convolution layer by using a convolution kernel to obtain feature mapping, and extracting feature mapping set matrixes C of different convolution kernels;
the specific convolution operation of the convolution kernel is defined as follows:
Cj=f1(Conv(M,wj)+bj)
wherein b isjAnd wjIs the corresponding deviation parameter and weight parameter of the jth convolution kernel, f1For the corresponding RELU activation function, the j convolution kernels perform a complete convolution operation on the n-h +1 window, and the extracted feature maps of the different convolution kernels are summarized as follows:
C=[c1|c2…|cp]T
s7, in k-max pooling, performing maximum merging operation on the matrix C, and extracting the most important k characteristic values to obtain a characteristic vector Z of the hidden layer;
s8, forming a full connection layer by using the vector Z with a fixed size, and operating the vector Z in the full connection layer to obtain an output characteristic y; processing the output characteristic y by using a softmax function, converting the multi-class output values into relative probabilities, and obtaining relative probability distribution p;
s9, calculating a cross entropy loss function L according to the obtained probability distribution p and label probability distributionk
Loss function LkIs defined as follows:
Figure FDA0002852854230000021
wherein L iskRepresenting the loss function calculated by inputting the kth sample, p (i ═ i | y) is the ith value of the softmax function output vector, representing the probability that the sample belongs to the ith class, and label liThe front is provided with a summation symbol ranging from 1 to T types, wherein T is 2 to represent the output number of the full connection layer;
l represents the average of the loss function for a batch of b data samples, which can be interpreted as the difference between the predicted value and the true value obtained by modeling, defined as follows:
Figure FDA0002852854230000022
s10, according to the calculated cross entropy loss function LkEstimating the difference between the predicted value and the true value of the model, and optimizing by using a gradient descent method to obtain an optimal Android program detection model;
s11, iteratively updating model parameters and optimizing the Android program detection model based on output calculation and the minimized loss function.
2. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S3, each opcode in the opcode sequence represented by the unique heat vector corresponds to one of the 256 positions, and an occurrence of an opcode is represented as 1 and an absence thereof is represented as 0.
3. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S7, k is 3 or 5.
4. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S8, the vector Z is operated in the full connection layer, the sum of the weights of the previous layer is calculated, and the accurate combination of each element is obtained to obtain the output feature y; the softmax function processes the output features y in the manner of a process classifier.
5. The convolutional neural network-based malware operation code analysis method of claim 1, wherein: in step S10, when the loss function is minimized, the minimized loss function and the parameter values of the corresponding model are gradually adjusted by using a gradient descent optimization method, so as to obtain an optimal Android program detection model.
CN201910776705.5A 2019-08-22 2019-08-22 Convolutional neural network-based malicious software operation code analysis method Active CN110472417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910776705.5A CN110472417B (en) 2019-08-22 2019-08-22 Convolutional neural network-based malicious software operation code analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910776705.5A CN110472417B (en) 2019-08-22 2019-08-22 Convolutional neural network-based malicious software operation code analysis method

Publications (2)

Publication Number Publication Date
CN110472417A CN110472417A (en) 2019-11-19
CN110472417B true CN110472417B (en) 2021-03-30

Family

ID=68512716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910776705.5A Active CN110472417B (en) 2019-08-22 2019-08-22 Convolutional neural network-based malicious software operation code analysis method

Country Status (1)

Country Link
CN (1) CN110472417B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11550911B2 (en) 2020-01-31 2023-01-10 Palo Alto Networks, Inc. Multi-representational learning models for static analysis of source code
US11615184B2 (en) * 2020-01-31 2023-03-28 Palo Alto Networks, Inc. Building multi-representational learning models for static analysis of source code
CN111444507B (en) * 2020-06-15 2020-11-03 鹏城实验室 Method, device, equipment and storage medium for judging whether shell-added software is misinformed
CN113378171B (en) * 2021-07-12 2022-06-21 东北大学秦皇岛分校 Android lasso software detection method based on convolutional neural network
CN117077141A (en) * 2023-10-13 2023-11-17 国网山东省电力公司鱼台县供电公司 Smart power grid malicious software detection method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721097B1 (en) * 2016-07-21 2017-08-01 Cylance Inc. Neural attention mechanisms for malware analysis
CN108985055A (en) * 2018-06-26 2018-12-11 东北大学秦皇岛分校 A kind of detection method and system of Malware
CN109002715A (en) * 2018-07-05 2018-12-14 东北大学秦皇岛分校 A kind of Malware recognition methods and system based on convolutional neural networks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9721097B1 (en) * 2016-07-21 2017-08-01 Cylance Inc. Neural attention mechanisms for malware analysis
CN108985055A (en) * 2018-06-26 2018-12-11 东北大学秦皇岛分校 A kind of detection method and system of Malware
CN109002715A (en) * 2018-07-05 2018-12-14 东北大学秦皇岛分校 A kind of Malware recognition methods and system based on convolutional neural networks

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Deep Android Malware Detection;Niall McLaughlin等;《网页在线公开:https://adamdoupe.com/publications/deep-android-malware-detection-codaspy2017.pdf》;20171231;全文 *
基于CNN和朴素贝叶斯方法的安卓恶意应用检测算法;李创丰等;《信息安全研究》;20190605;第5卷(第6期);第470-476页 *
基于Le net一5的卷积神经网络改进算法;李丹等;《计算机时代》;20161009(第8期);第4-12页 *
基于卷积神经网络的Android恶意应用检测方法;郗桐等;《信息安全研究》;20180907;第4卷(第8期);第715-721页 *

Also Published As

Publication number Publication date
CN110472417A (en) 2019-11-19

Similar Documents

Publication Publication Date Title
CN110472417B (en) Convolutional neural network-based malicious software operation code analysis method
CN110348214B (en) Method and system for detecting malicious codes
CN111368920A (en) Quantum twin neural network-based binary classification method and face recognition method thereof
JP5207870B2 (en) Dimension reduction method, pattern recognition dictionary generation device, and pattern recognition device
CN111340132B (en) Machine olfaction mode identification method based on DA-SVM
CN109033833B (en) Malicious code classification method based on multiple features and feature selection
Mostavi et al. Deep-2'-O-me: predicting 2'-O-methylation sites by convolutional neural networks
CN111428557A (en) Method and device for automatically checking handwritten signature based on neural network model
CN113806746A (en) Malicious code detection method based on improved CNN network
CN112364974B (en) YOLOv3 algorithm based on activation function improvement
CN116308754B (en) Bank credit risk early warning system and method thereof
CN113434699A (en) Pre-training method of BERT model, computer device and storage medium
CN109324595B (en) Industrial monitoring data classification method based on incremental PCA
CN111224998A (en) Botnet identification method based on extreme learning machine
WO2022162427A1 (en) Annotation-efficient image anomaly detection
CN117078007A (en) Multi-scale wind control system integrating scale labels and method thereof
CN107943916B (en) Webpage anomaly detection method based on online classification
CN113139368B (en) Text editing method and system
CN112784927B (en) Semi-automatic image labeling method based on online learning
CN111079143B (en) Trojan horse detection method based on multi-dimensional feature map
CN115292702A (en) Malicious code family identification method, device, equipment and storage medium
CN115512144A (en) Automatic XRF spectrogram classification method based on convolution self-encoder
CN114187966A (en) Single-cell RNA sequence missing value filling method based on generation countermeasure network
CN114579763A (en) Character-level confrontation sample generation method for Chinese text classification task
CN107292280A (en) A kind of seal automatic font identification method and identifying device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant