CN116821905A

CN116821905A - Knowledge search-based malicious software detection method and system

Info

Publication number: CN116821905A
Application number: CN202310639896.7A
Authority: CN
Inventors: 朱会娟; 夏梦珍; 王良民; 马润泽; 徐志城; 陈磊; 朱海宇
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2023-06-01
Filing date: 2023-06-01
Publication date: 2023-09-29

Abstract

The invention discloses a method and a system for detecting malicious software based on knowledge search, which are based on a model training mode of a knowledge distillation algorithm and a neural network structure search technology so as to improve the training mode of the existing deep learning. Firstly, extracting required feature files from an android software sample, extracting required feature types from the feature files, and screening features to form a feature data set for representing the android software sample. And secondly, the multi-layer perceptron network can effectively capture the correlation and the dependence among original features, and has good classification advantages. Knowledge distillation and neural network structure search technology are introduced, a teacher network model participates in a training process of a student network model, the student network model with the strongest learning ability is adaptively searched, the performance gap between the teacher network and the student network model is solved, and a feasibility scheme is provided for the field of malware detection based on deep learning.

Description

Knowledge search-based malicious software detection method and system

Technical Field

The invention relates to a malicious behavior detection technology, in particular to a method and a system for detecting malicious software based on knowledge search.

Background

Knowledge distillation (Knowledge Distillation) is an important branch of the field of model compression, and was first proposed by Hinton et al in 2015, which has led to extensive attention and research in academia and industry. Knowledge distillation is a technique for reducing model size and accelerating model reasoning without degrading model accuracy. The core idea is to compress the knowledge of a complex model (teacher model) into a simple model (student model), so that the student model has smaller volume and faster reasoning speed while maintaining high accuracy. Knowledge distillation is widely applied to the fields of natural language processing, computer vision, voice recognition and the like, and reduces the size and the calculated amount of a model while maintaining the accuracy of the model. Knowledge of the BERT model is transferred into small models, for example, using knowledge distillation techniques in natural language processing.

The multi-layer perceptron (Multilayer Perceptron, MLP) is an artificial neural network (Artificial Neural Network, ANN), which is one of representative algorithms of Deep Learning (Deep Learning), and is composed of an input layer, a hidden layer and an output layer, and the layers are fully connected. The multi-layer perceptron network has advantages in self-learning and self-adaption, has good expandability and universality, can handle nonlinear problems, and is widely applied to tasks such as pattern recognition, classification, regression and the like.

The neural network structure search technique (Neural Architecture Search, NAS) was derived from Zoph et al in 2016 to implement the most advanced image recognition and language construction and reinforcement learning algorithms, and neural network structure search by reinforcement learning was superior to manual design networks. The essence of the neural network structure search technique is to change the process of manually adjusting the neural network into an automatic execution task to find the best structure of the neural network, test and evaluate a large number of architectures in the search space using a search strategy, and select the architecture that best meets the given problem objective by maximizing the fitness function.

In the work of the existing malware detection and recognition method, for example, in a malware recognition method based on visual transducer published in CN115879109a, a malware image dataset is constructed by acquiring an ImageNet 21K image dataset and visualizing an application software executable file set sample as an RGB image, a visual transducer model construction knowledge distillation model including an X-layer encoder is constructed to judge the unknown software as benign software or malware and judge a family label to which the malware belongs. In the Android malicious software detection method based on cost sensitive learning disclosed in CN116070209A, original feature vectors are constructed through extracting rights and four components to represent samples, a sample sensitive weight sequence is obtained through a sample sensitive weight calculation method and is used for feature selection and model training, the problems of high misinformation rate of an Android malicious software detection model on malicious software, insufficient generalization capability of the model and the like are caused by unbalanced data sets, and the efficiency and the accuracy of the detection model are improved.

For another example, in the technical scheme of cn_111611377_b, the BERT is trained as a teacher model through a knowledge distillation technology, then a multi-layer bidirectional long and short-time memory network of a student model is trained, losses of the teacher model among an embedded layer, a hidden layer and a prediction layer are learned while the student model is trained, different spatial representations are calculated through linear transformation, and finally a trained miniature student model is obtained, so that calculation cost is high. For another example, in the technical solution of cn_112367473_a, a teacher network model is trained first, then a student model is built, and training of the student model is assisted based on a knowledge distillation method, which not only has high calculation cost, but also requires preprocessing of an original data packet first.

Disclosure of Invention

The invention aims to: the invention aims to solve the defects in the prior art, and provides a malware detection method based on knowledge search, which is based on self-learning and self-adapting capability of a multi-layer perceptron network, integrates the advantages of knowledge distillation technology and neural network structure search technology, extracts high-representative characteristics of software, improves training effect and efficiency, realizes high detection performance of student models, improves generalization capability of the models, and greatly reduces calculation cost.

The technical scheme is as follows: the invention discloses a malicious software detection method based on knowledge search, which comprises the following steps:

step 1, identifying the file type of an application program file to be tested by judging whether the input application program file to be tested is a compressed file or not, namely judging whether the application program file to be tested is android software or not;

step 2, if the input sample is android software, decompressing an installation file of the android software, extracting API features, permission features and export component features from the decompressed apk file to form a feature set, and repeating the steps to obtain a feature set of a large number of apk files so as to construct a corresponding training set and a corresponding test set;

step 3, screening the features in the feature set participating in training through a decision tree algorithm, extracting high-representative features, and improving the characterization capability of the feature data set;

step 4, constructing a teacher network model and a student network model based on the knowledge distillation network of the multi-layer perceptron, and simultaneously training the teacher network and the student network by using the training set sample constructed in the step 2, wherein the teacher network participates in training the student network in the training process;

step 5, introducing a neural network structure search technology into the knowledge distillation network, adaptively searching the student network with the strongest learning ability, and improving the robustness and generalization ability of the model; the specific method comprises the following steps:

step 5.1, loading the teacher network model and the student network model stored in the step 4;

step 5.2, each layer of the teacher network and the student network uses a ReLU activation function, and the final classifier is a combination of a linear layer and a softmax activation function; because whether the file to be tested is malicious or benign needs to be identified, and the file to be tested belongs to the two classification problems, the output branch number is set to be 2, and the probability of two results is output;

step 5.3, in a group of parallel student networks, the neural network structure searching technology adaptively searches out a student network model with the strongest learning ability, and the loss function of the searching process is defined as follows:

first, calculating an error generated by the output of a teacher network model:

secondly, calculating errors generated by the participation of a teacher network in guiding the output of a student network model:

finally, the total error is calculated:

Loss _total ＝Loss _teacher +Loss _student

wherein S and T represent student network and teacher network respectively,and->Is the normalized output of the softmax function for the teacher and student networks;

and 6, applying the model based on the knowledge distillation and neural network structure search technology trained in the step 5 to detection of android software to be detected.

Further, the specific step of extracting the apk file features in the step 2 is as follows:

step 2.1, acquiring API features, permission features and export component features of an apk file, extracting the API and permission features by using an android tool, and extracting the export component features, namely activities, broadcast receivers, content providers and services by using a Drozer tool;

step 2.2, coarsely granularity of all APIs used in the dex file is changed into key API classes, feature vectors are synthesized, and calling times of the APIs are used as feature values to form an API feature set; the extracted characteristics containing official permission and custom permission adopt a classical binary characterization mode to form a permission characteristic set; and extracting the feature of the derived component according to the test of the four components, and forming a feature set of the derived component by adopting a classical binary characterization mode.

Repeating steps 2.1 to 2.2 to obtain three feature sets of a large batch of apk files and constructing a training data set and a test data set of the model; and respectively carrying out decision tree screening on the API feature set, the permission feature set and the derived component feature set in the training data set, and combining the three screened feature sets to form a final data set containing three features.

Further, the detailed process of the step 3 is as follows:

step 3.1, inputting an original characteristic data set, namely inputting the characteristic data set which is finally obtained in the step 2.2 and contains three characteristics;

and 3.2, processing a high-dimensional data principle through a decision tree algorithm, and calculating through the following formula:

first, the probability distribution of the feature is calculated:

P(X＝x _i )＝p _i i＝1,2,3,...,n

wherein X is _i Refers to features such as: write_ CONTACTS, HARDWARE _test;

the empirical entropy of the features is then calculated:

wherein 0.ltoreq.H (X.ltoreq.logn) and n represents the number of categories of data (i.e., whether malware or benign software is ultimately detected);

after completion, the conditional entropy of the feature is calculated:

P(X＝x _i ,Y＝y _j )＝p _ij i＝1,2,3,...,n j＝1,2,3,...,n

wherein X is a feature (including API, permission and deriving specific features in the component dataset), Y is a category (malicious or benign), P (x=x _i ,Y＝y _j ) For joint probability distribution, H (Y|X) is conditional entropy;

and 3.3, selecting an optimal feature as a node dividing feature through the information gain of each feature, wherein the calculation formula is as follows:

IG(D,A)＝H(D)-H(D|A)

the information gain of the feature A on the training data set D is the difference between the empirical entropy of the set D and the empirical conditional entropy of the feature A under the given condition D.

Further, in the step 4, a teacher network model is built based on the multi-layer perceptron MLP, and the teacher network model comprises an input layer, 3 hidden layers and an output layer; constructing a student network model based on a multi-layer perceptron MLP, wherein the student network model comprises an input layer, 1 hidden layer and an output layer; each layer of the teacher network and the student network uses a ReLU activation function, and the final classifier is a combination of a linear layer and a softmax activation function; the parameters for constructing the teacher network model and the student network model are as follows:

activation function: relu=max (O, w ^T x+b)

Wherein w is ^T Is the transpose matrix of each interlayer weight matrix, x is the input vector, b is the bias between the layersPlacing; classification function:

wherein z is _k The output value of the kth node is C, and the number of the output nodes is the number of the classified categories; loss function:

wherein L is _CE L is a cross entropy loss function _MSE For the mean square error loss function, Y is the true label value,for the predicted probability value, +.>

The invention also discloses a malicious software detection system based on knowledge search, which comprises a feature extraction module, a feature data set construction module and a knowledge search module;

the feature extraction module is used for extracting feature files in the files to be detected to obtain API, permission and derived component features of the files to be detected;

the feature data set construction module uses a decision tree algorithm, and uses information gain in high-dimensional features to screen out more representative features, so that the characterization capability of the features on a sample is improved, and the detection utility is improved;

the knowledge search module combines knowledge distillation and neural network structure search to construct a teacher network model and a student network model, trains the teacher network model and the student network model at the same time, and adaptively searches out the student network model with the strongest learning ability as a final student network model through a neural network structure search technology in a group of parallel student network models, and the final student network model classifies the detection of the software files to be detected, namely malicious software or benign software; the knowledge search module applies a trainable end-to-end structure, and adopts a mode of simultaneous training to realize better bottom expression capacity of the model; the flexible setting of the network quantity of teachers and students is realized, and the detection capability and generalization capability of the model are improved.

The beneficial effects are that: the method can effectively detect the android sample, improves the problem of performance gap left in knowledge distillation technology, and provides a solving way for the limitation of deployment in the practical application environment with limited resources because the deep learning-based model at the present stage has complex structure and strong calculation capability; compared with the prior art, the invention has the following advantages:

(1) According to the invention, the component characteristic characterization software is extracted API, permission and derived, and the multi-type characteristic characterization mode can effectively cope with resistance brought by the confusion technology and the shell adding technology to the detection process.

(2) The invention uses decision tree algorithm to screen out the characteristic with high representativeness from the high-dimensional characteristic, thereby improving the characteristic characterization capability and enhancing the detection benefit of the model.

(3) The invention uses the thought of knowledge distillation, trains the teacher model and the student model simultaneously, and the teacher model participates in guiding the student model in the training process, so that the student model has better bottom layer expression capability, reduces calculation cost and can ensure high accuracy.

(4) The invention integrates the neural network structure searching technology, and self-adaptively searches the network with the strongest learning ability from the parallel training network to be used as the final student network model, thereby realizing the flexible setting of the teacher and the student network model and improving the model detection performance.

Drawings

FIG. 1 is a flow chart of extracting features of a document to be tested according to the present invention;

FIG. 2 is a flow chart of the decision tree algorithm used in the present invention;

FIG. 3 is a flow chart of the present invention for training a teacher model and a student model and adaptively searching;

fig. 4 is a schematic diagram of the structure of a teacher network and a student network according to the present invention.

Detailed Description

The technical scheme of the present invention is described in detail below, but the scope of the present invention is not limited to the embodiments.

The invention discloses a malicious software detection method based on knowledge search, which comprises the following steps:

step 1, judging whether the input application program file is android software or not by judging the type of the input application program file;

step 1.1, acquiring API features, permission features and export component features of an apk file, extracting the API and permission features by using an android tool, and extracting and exporting four large component features, namely activity, broadcast receivers, content providers and services by using a Drozer tool;

step 1.2, coarsely granularity of all APIs used in the dex file is changed into key API classes, feature vectors are synthesized, and calling times of the APIs are calculated to be used as feature values; the extracted characteristics containing official permission and custom permission adopt a classical binary characterization mode to form a characteristic set; extracting and deriving component characteristics according to the four-component test, and forming a characteristic data set by adopting a classical binary representation mode;

step 2, if the application program is android software, decompressing an installation file of the android software, extracting API features and permission features from the decompressed apk file to form a feature set, and repeating the steps to obtain a feature set of a large number of apk files as shown in fig. 1 so as to construct a training set and a testing set for use by a subsequent model;

step 3, as shown in fig. 2, the features in the training set are screened through a decision tree algorithm, and the specific method is as follows:

step 3.1, inputting the original characteristic data set

first, the probability distribution of the feature is calculated:

P(X＝x _i )＝p _i i＝1,2,3,...,n

the empirical entropy of the features is then calculated:

wherein 0.ltoreq.H (X.ltoreq.logn and n represents the number of categories of data;

after completion, the conditional entropy of the feature is calculated:

P(X＝x _i ,Y＝y _j )＝p _ij i＝1,2,3,...,n j＝1,2,3,...,n

wherein X is a feature, Y is a category, P (x=x _i ,Y＝y _j ) For joint probability distribution, H (Y|X) is conditional entropy;

IG(D,A)＝H(D)-H(D|A)

the information gain of the feature A on the training data set D is the difference between the empirical entropy of the set D and the empirical conditional entropy of the feature A under the given condition D;

and 4, as shown in fig. 3 and 4, using a sample training model under a Windows platform, constructing a teacher network model and a student network model based on a knowledge distillation network of the multi-layer perceptron, training the teacher network and the student network simultaneously, and enabling the teacher network to participate in training the student network.

The teacher network model comprises an input layer, 3 hidden layers and an output layer; the student network model comprises an input layer, 1 hidden layer and an output layer; each layer of the teacher and student networks uses a ReLU activation function and the final classifier is a combination of a linear layer and a softmax activation function.

The parameters of the teacher network model and the student network model built based on the multilayer perceptron mechanism are as follows:

activation function: relu=max (O, w ^T x+b)

Wherein w is ^T The transpose matrix of each interlayer weight matrix is represented by x, which is an input vector, and b, which is an offset between the layers; classification function:

Step 5, introducing a neural network structure search technology into a knowledge distillation network, and adaptively searching a student network with the strongest learning ability, wherein the specific method comprises the following steps:

step 5.2, each layer of the teacher network and the student network uses a ReLU activation function, and the final classifier is a combination of a linear layer and a softmax activation function;

first, calculating an error generated by the output of a teacher network model:

finally, the total error is calculated:

Loss _total ＝Loss _teacher +Loss _student

and 6, applying the model based on the knowledge distillation and neural network structure search technology trained in the step 5 to detection of android software to be detected, and outputting whether the software is benign or malicious.

According to the technical scheme, the teacher model and the student model are trained simultaneously, on one hand, the teacher network model can give the experience of the teacher as priori knowledge to the student network model, so that the student network model can inherit ancestor experiences, and on the other hand, the student network model accumulates the experience of the student model in the training process, so that the student network model has better bottom expression capability, and further effective detection of malicious software can be realized. In addition, the method also screens out the characteristic with high information gain coefficient, which is more representative to the characterization of the malicious software, through the information gain method in the decision tree algorithm, thereby improving the detection utility.

Examples:

in order to verify the performance effect of the technical scheme, the training and testing are carried out by using android malicious software and android benign software in the same data set, and the result shows that the accuracy rate (Acc) of a student network model is 96.65%, the accuracy rate (Pre) is 97.89%, the F1-score (F1) is 96.57%, the Auc is 96.75% and the FPR is 2.82%; the accuracy (Acc) of the teacher network model is 96.63%, the accuracy (Pre) is 97.84%, the F1-score (F1) is 96.56%, the Auc is 96.70% and the FPR is 2.89%, and meanwhile, the network structure is reduced by 50%, and the parameter is reduced by 34%. Therefore, the method provided by the invention can compress the model and simultaneously has good detection effect, maintain high performance and reduce calculation cost.

Examples:

this example compares the solution of the present invention with three prior art techniques (Chan P et al, PARF, droidSieve et al, wu B et al, XMAL).

The PARF technique proposed by Chan P et al extracts official rights and API features in each application and filters out more representative features using Information Gain (IG). Experiments compare the paper performance to the best random forest RF classifier.

The study DroidSieve by surez-tail et al extracts features such as API, rights, components and statistical type, sorts the extracted features by Mean Decrease Impurity (MDI) method, uses extreme random Trees (Extra Trees) to detect malware by the top-ranked API, rights, components and Stat (cert_diff.1) features.

Wu B et al research uses the latest source code of XMAL to obtain a feature dataset of dimension 154 (comprising 94 APIs and 60 rights features) for malware detection.

The experimental comparison results are shown in table 1.

TABLE 1

In summary, the neural network structure searching technology is introduced into the knowledge distillation model, so that the final student network model with the strongest learning ability is adaptively searched out, and the detection performance of the lightweight model is maximally improved; extracting API, permission and deriving component characteristics, and characterizing malicious software at multiple angles and multiple levels, so that the problem of single-type characteristic characterization unilateralness is solved, and the interference of an antagonism technology and a confusion technology is effectively relieved; and the information gain thought in a decision tree algorithm is used for the original characteristics, and the characteristics which are more characteristic to the sample are selected, so that the detection utility is improved.

Claims

1. The method for detecting the malicious software based on the knowledge search is characterized by comprising the following steps of:

step 1, identifying the file type of an application program file to be tested by judging whether the input application program file to be tested is a compressed file, namely judging whether the software to be tested is android software;

step 2, if the input sample is android software, decompressing an installation file of the android software, extracting API features, permission features and export component features from the decompressed apk file to form a corresponding feature set, and repeating the steps to obtain a feature set of a large number of apk files so as to construct a corresponding training set and a test set;

step 3, screening the features in the feature set participating in training through a decision tree algorithm;

step 4, constructing a teacher network model and a student network model based on the knowledge distillation network of the multi-layer perceptron, and simultaneously training the teacher network and the student network by using the training set sample constructed in the step 3, wherein the teacher network participates in training the student network in the training process;

first, calculating an error generated by the output of a teacher network model:

finally, the total error is calculated:

Loss _total ＝Loss _teacher +Loss _student

2. The knowledge search-based malware detection method according to claim 1, wherein the specific steps of extracting apk file features in step 2 are:

step 2.1, acquiring API features, permission features and export component features of an apk file, extracting the API and permission features by using an android tool, and extracting and exporting four large component features, namely activity, broadcast receivers, content providers and services by using a Drozer tool;

step 2.2, coarsely granularity of all APIs used in the dex file is changed into key API classes, feature vectors are synthesized, and calling times of the APIs are used as feature values to form an API feature set; the extracted characteristics containing official permission and custom permission adopt a classical binary characterization mode to form a permission characteristic set; extracting and exporting component features according to the four-component test, and forming an exported component feature set by adopting a classical binary representation mode;

and (3) repeating the step (2.1) and the step (2.2) to obtain three feature sets of a large quantity of apk files, constructing a training data set and a test data set of a model, and respectively screening decision trees for the API feature set, the permission feature set and the derived component feature set in the training data set, wherein the three screened feature sets are combined to form a final data set containing three features.

3. The method for detecting malware based on knowledge search according to claim 1, wherein the detailed procedure of step 3 is:

step 3.1, inputting the original characteristic data set

first, the probability distribution of the feature is calculated:

P(X＝x _i )＝p _i i＝1,2,3,...,n

the empirical entropy of the features is then calculated:

after completion, the conditional entropy of the feature is calculated:

P(X＝x _i ,Y＝y _j )＝p _ij i＝1,2,3,...,n j＝1,2,3,...,n

IG(D,A)＝H(D)-H(D|A)

4. The knowledge search-based malware detection method according to claim 1, wherein in the step 4, a teacher network model is built based on a multi-layer perceptron MLP, and the teacher network model includes an input layer, 3 hidden layers, and an output layer; constructing a student network model based on a multi-layer perceptron MLP, wherein the student network model comprises an input layer, 1 hidden layer and an output layer; each layer of the teacher network and the student network uses a ReLU activation function, and the final classifier is a combination of a linear layer and a softmax activation function;

wherein the activation function is relu=max (0,w ^T x+b)；

w ^T The transpose matrix of each interlayer weight matrix is represented by x, which is an input vector, and b, which is an offset between the layers;

wherein the classification function is that

z _k The output value of the kth node is C, and the number of the output nodes is the number of the classified categories;

wherein the loss function:

L _CE l is a cross entropy loss function _MSE For the mean square error loss function, Y is the true label value,for the predicted probability value, +.>

5. A detection system for implementing the knowledge search-based malware detection method of any one of claims 1 to 4, comprising a feature extraction module, a feature dataset construction module, and a knowledge search module;

the feature extraction module extracts feature files in the files to be detected to obtain API, permission and derived component features of the files to be detected;

the feature data set construction module uses a decision tree algorithm, and uses information gain in the high-dimensional features to screen out more representative features;

the knowledge search module combines knowledge distillation and neural network structure search to construct a teacher network model and a student network model, trains the teacher network model and the student network model at the same time, and adaptively searches out the final student network model with the strongest learning ability in a group of parallel student network models through the neural network structure search technology, and the final student network model classifies the detection of the software files to be detected, namely malicious software or benign software.