CN111783093A

CN111783093A - Malicious software classification and detection method based on soft dependence

Info

Publication number: CN111783093A
Application number: CN202010595193.5A
Authority: CN
Inventors: 刘哲; 张永超
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2020-10-16

Abstract

The invention discloses a method for classifying and detecting malicious software based on soft dependency, which is used for relieving the problem of misclassification of the malicious software during classification. The method divides the extracted malicious software features into a plurality of feature subsets, and trains models respectively aiming at the feature subsets; then, classifying the new malicious software by using the models to generate respective classification results; and finally, obtaining a final classification result by using s-value (soft dependence value proposed based on a mixed distance standard in pattern recognition). The method mainly combines the classification results of a plurality of models through the mixed distance standard in the pattern recognition, and the result is regarded as the final classification result.

Description

Malicious software classification and detection method based on soft dependence

Technical Field

The invention relates to a malicious software classification and detection method based on soft dependence, and belongs to the field of safety.

Background

Along with the popularization and development of computers, malicious software also grows. However, due to the large amount of malware, it is difficult for conventional antivirus software to identify emerging malware. For example, word.whboy.cw and ranomware were shown at the end of 2006 and 2017, respectively. The former are variants of worms and the latter are novel viruses that bind to worms to increase their infectious range. These viruses are difficult to detect using conventional antivirus software. According to statistical analysis, the malicious software will grow explosively in this situation, and the user will inevitably be attacked by the malicious software which is ineffective in the conventional anti-virus software defense. Therefore, there is a need to propose a more efficient method to solve this problem. However, conventional antivirus software identifies malware through signature-based methods, which require large local databases to store the malicious signatures. However, this approach can be easily avoided by using encryption, obfuscation, and packaging to implement malware polymorphism. And traditional antivirus software does not work well against new malware variants.

Based on the above problems, the method based on machine learning/deep learning attracts attention from both academic and industrial fields. The machine learning/deep learning-based method can make up for the defects of traditional malicious software classification and identification, and can obtain good identification accuracy aiming at malicious software variants. Features are extracted only for the malicious software, then a machine learning/deep learning method is used for training, then a training model is generated, and finally the training model can be used for predicting the new malicious software. Some methods do not even need to extract features, and only need to convert malware into a certain format, for example, convert malware in a binary form into a gray map, a CNN training model can be used, which is undoubtedly more convenient.

However, the machine learning/deep learning based approach brings many advantages and opportunities as well as challenges. Since the machine learning/deep learning method is to train the model according to the samples, the first problem is the over-fitting and under-fitting problems, both the over-fitting and under-fitting problems lead to the reduction of the model performance, the generalization capability is low, and the method is not suitable for large-scale use. In addition, malware feature extraction is a relatively large problem to be faced. Since the extraction of features requires relatively professional feature engineering knowledge, it is difficult for a person who does not know the feature engineering. Moreover, it is unknown whether the dimension of the extracted features is too high, whether the extracted features are useful for training the model, and whether more powerful features are not found. Although some malware can be classified and identified by converting into other forms by using a special deep learning method, the malware accounts for a few, and extraction of characteristics of the malware cannot be avoided. In addition, the choice of machine learning/deep learning methods is also a matter of consideration, for example, some deep learning methods are not suitable for performing malware classification and detection tasks. In addition, although the model trained by some methods can achieve good accuracy, the training time and the prediction time are long, and the accuracy is generally recognized in a time-based mode. For the sample feature portion, the features may be processed using Principal Component Analysis (PCA). Although the academia currently proposes many methods for classifying or detecting malware, each flaw has more or less flaws. For example, feature-based methods also rely too heavily on the features of the training samples, which may result in overfitting of the model. Furthermore, if the model classifies a new type of malware that does not belong to any training series, misclassification is highly likely to occur. The occurrence of these problems is likely to lead to a degradation of the model performance. Meanwhile, there is no good balance among model accuracy, training time and prediction time, and time is often used for accuracy. Current research is generally less concerned about these problems.

Disclosure of Invention

Aiming at the problems that the existing method has low identification precision on some malware variants and the balance among the identification precision, training time and prediction time is poor, a soft-dependence-based malware classification and detection method is provided. The invention can effectively detect some malicious software variants, finds a good balance among the recognition accuracy, the training time and the prediction time, and reduces the training time and the prediction time while ensuring the recognition accuracy.

The invention adopts the following technical scheme for solving the technical problems:

a method for malware classification and detection based on soft dependency, comprising the steps of:

the method comprises the following steps: performing feature extraction on all the malware samples in the malware sample training set, and dividing the extracted features into n feature subsets F₀,F₁,…,F_n-1(ii) a The malware sample training set is train ═ D₀，D₁，......，D_k-1In which D is_iRepresenting the ith malware family in the malware sample training set, wherein i is more than or equal to 0 and less than or equal to k-1, and k represents the number of the malware families in the malware sample training set; j-th feature subset F_j＝{Df₀，Df₁，......，Df_k-1}，Df_iRepresenting the characteristics of all the malicious software samples in the ith malicious software family in the characteristic subset, wherein j is more than or equal to 0 and less than or equal to n-1;

step two: respectively using n feature subsets F in step one₀,F₁,…,F_n-1Machine learning for training samples to generate n models M₀,M₁,…,M_n-1；

Step three: using n models M in step two₀,M₁,…,M_n-1Respectively classifying the malicious software to be detected to obtain n classification results P₀,P₁,…,P_n-1；

Step four: calculating a mixer off-standard value s-value for each feature subset;

step five: according to the result predicted by each model obtained in the third step, s-value of each characteristic subset in the fourth step is s-value₀,s-value₁,...,s-value_n-1Respectively as the weight of the classification result obtained in the step three, and obtaining the classification result P | | | s-value of the malicious software to be detected₀||·P₀+||s-value₁||·P₁+...+||s-value_n-1||·P_n-1。

Further, the jth feature subset F in step four_jIs off the standard value s_jComprises the following steps:

s-value_j＝EC_jα+EN_jβ–EI_jγ

wherein, F_jStandard value EC of distance between centroids of family of moderate malware_j＝(ec_pq)_k*k，ec_pqIs F_jThe distance between the pth and qth malware family centroids; f_jMinimum distance standard value EN between middle malware families_j＝(en_ab)_k*k，en_abIs F_jThe shortest distance between the a-th and b-th malware families; f_jInter-class distance criterion EI within the Medium malware family_j＝(ei_ab)_k*2，ei_a1Is F_jMaximum distance between malware sample features within the a-th malware family, ei_a2Is F_jα, β and gamma are respectively the weight of EC, EN and the inter-class distance standard value EI in the malware family.

Further, the malware family has a centroid of

α_uIs a feature of the u-th malware sample in the malware family.

Further, α, β and γ are found in the machine learning process in step two by minimizing a loss function, the loss function being

y_vAnd N is the number of the malware samples in the malware sample training set.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects:

the invention provides a method for identifying the malicious software variants and seeking the balance problem among the identification accuracy, the training time and the prediction time, thereby relieving the problem of low identification accuracy of the malicious software variants and relieving the problem of replacing the identification accuracy with time. In addition, the number of the malicious software is explosively increased in an exponential mode, most of the malicious software is evolved from known malicious software, certain malicious software variants can be identified by the method, and a new malicious software identification model can be trained rapidly while the accuracy is guaranteed for the unrecognized malicious software variants. The invention can relieve the influence of the violent increase of the amount of the malicious software on the computer safety.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

the invention provides a malware classification and detection method based on soft dependency, aiming at the problems of low malware variant identification accuracy and balance among identification accuracy, training time and prediction time. The invention aims to reduce the training time and the prediction time while ensuring the recognition accuracy, find a balance among the recognition accuracy, the training time and the prediction time, and simultaneously identify some malware variants and the like.

The classification and detection of the malicious software are divided into 2 stages: a model training phase and an s-value generation final result phase.

Before describing these two stages, the description of the parameters used in the present invention will be described. First, the malware sample training set is labeled as train ═ D₀，D₁，......，D_k-1In which D is_iThe method is characterized by representing the ith malware family in a malware sample training set (the same type of malware is called as a malware family), i is more than or equal to 0 and less than or equal to k-1, k represents the number of the malware families in the malware sample training set, and y represents the labels label of all malware samples in the malware sample training set. During feature extraction, feature extraction is performed on all malware samples in a malware sample training set, and then the IDs (generally represented by hash values of the names of the malware) of the malware samples, the malware families to which the malware samples belong and files in which the features of the malware samples are stored are obtained. The features of all malware samples are divided into n feature subsets F₀,F₁,…,F_n-1When the feature subsets are divided, some features with larger relevance of the malicious software sample are divided into the same feature subset, such as the malicious software sampleThe size, the compressed size, the compression ratio and other characteristics of the image can be divided into the same characteristic subset; typically we use the chi-square test to compute the sum j (index of feature subset, 0) of the feature values of a certain column of all samples<＝j<N-1) to measure the similarity between features, and the first k features after ascending sorting are considered as the features with larger relevance. Thus, the jth feature subset may be denoted as F_j＝{Df₀，Df₁，......，Df_k-1Where Df is_iRepresenting the features of all malware samples in the ith malware family in the feature subset. The vector of the features of a certain malware sample s of a certain malware family in a certain feature subset is denoted sample_i＝{f₁,f₂,......f_m-1In which f_xThe characteristic value of a sample s is shown, m represents the number of characteristic values, and 0 < ═ x < ≦ m-1.

The model training phase and the s-value generation final result phase are described beginning below.

1) Model training phase

In the stage, each feature subset is extracted according to a malicious software sample, then certain machine learning/deep learning algorithms (XGboost) are used for training corresponding sub-models according to the feature subsets, and finally the sub-models are used for generating prediction results which are marked as P₀,P₁,…,P_n-1。

2) s-value final result generation phase

This stage generates final malware classification results for the prediction results generated in 1) using s-value. Wherein s-value is proposed based on a hybrid distance criterion in pattern recognition, and the formula is as follows:

s-value＝ECα+ENβ–EIγ

wherein s-value is a mixed distance standard value, EC is a centroid distance standard value, EN is a minimum distance standard value between malware families, EI is an inter-class distance standard value within a malware family, and α, β, and γ are weights of the 3 standard values, respectively, for adjusting the importance of each standard.

The method comprises the following steps: and solving the EC.

EC_jRepresenting a subset of features F_jDistance standard value between centers of mass of family of medium malicious software, in feature subset F_jThe larger the value of the criterion, the higher the distance between the centroids of any two malware families, the higher the discrimination between different families. In this section, to obtain the EC standard value, first of all, the feature subset F should be_jAccording to the feature vector Df of all malware samples in each malware family_iCalculating the centroid of each malware family, the formula is as follows:

where U is the number of malware in a malware family, α_uIs the feature vector of the u-th malware sample in the malware family.

Finally, obtaining a feature subset F according to the formula for calculating the mass center_jCalculating the centroid of each malware family, then calculating the distance between the centroids of any two malware families, and finally obtaining a k x k matrix, and marking the matrix as EC and element EC_pqIs represented by F_jThe distance between the pth and qth malware family centroids, the main diagonal elements of the matrix are both 0.

Step two: and solving EN.

EN_jIs represented in feature subset F_jShortest distance between two malware families, i.e. traversal F_jFinding the distance between any feature vector in the a-th malware family and any feature vector in the b-th malware family, taking the minimum value of the distance as the shortest distance between the a-th malware family and the b-th malware family, finally obtaining a k x k matrix, and marking the matrix as EN, wherein the element EN in the EN_abIs F_jThe shortest distance between the a-th and b-th malware families in the matrix, the diagonal element in the matrix is 0.

Step three: and solving the EI.

EI_jTo representFeature subset F_jThe inter-class distance criterion value in the malware family has two values, namely in the feature subset F_jThe maximum value of the distance between the feature vectors of the malware samples in a malware family and the sum of the distance between the feature vectors of all the malware samples in the malware family.

Finally, a matrix k x 2 is obtained, and the matrix is marked as EI_j＝(ei_ab)_k*2EI in EI_a1Is F_jMaximum distance between malware sample features within the a-th malware family, ei_a2Is F_jSum of distances between malware sample features within the a-th malware family.

Step four: and solving the s-value.

Having obtained the values of EC, EN and EI through the three steps described above, we also need to know the values of α, β and γ in order to solve for the s-value of each feature subset. These three values are found by optimizing the loss function loss during the training of the corresponding sub-model using machine learning/deep learning. After the calculation, the s-value of each feature subset can be finally obtained, and then the final classification result of the malicious software can be obtained according to the prediction result generated in 1).

In the following, we will specifically describe the solution methods of α, β, and γ and the final malware classification result calculation method.

To obtain the values of α, β and γ, the formula s-value ═ EC α + EN β -EI γ can be written as follows:

when each feature subset is used as input to train each corresponding sub-model, the prediction result of each malicious software sample of the corresponding sub-model is output on the corresponding test set of the corresponding sub-model, and the prediction result of a certain malicious software sample is recorded as p₀,p₁,……,p_n-1(all vectors are 1 x n). EC, EN and EI obtained by each feature subset are respectively marked as [ EC₀,EN₀,EI₀]、[EC₁,EN₁,EI₁]、……、[EC_n-1,EN_n-1,EI_n-1](ii) a Then obtaining a group of s-values which are respectively marked as s-values₀、s-value₁、……、s-value_n-1To find the values of α and γ, the following equation is used as the loss function:

wherein y is_vAnd (3) using a vector representation of 1 x N for the label of the v-th malware sample, wherein N represents the number of all training samples in the training set.

In order to converge the model, only the loss function needs to be minimized. Finally, by minimizing the loss function, the values of α, β, and γ can be found.

In addition, sub-models trained on the respective feature subsets have been derived, each sub-model deriving its own prediction, denoted P, from predictions of the test set₀,P₁,……,P_n-1. Then, through calculation, an s-value of each feature subset can be obtained and is marked as s₀，s₁，…s_n-1. And finally calculating a final classification result through the following formula:

P＝||s-value₀||·P₀+||s-value₁||·P₁+...+||s-value_n-1||·P_n-1。

as shown in fig. 1, the specific implementation process of the present invention is as follows:

the method comprises the following steps: performing feature extraction on all the malware samples in the malware sample training set, and dividing the extracted features into n feature subsets F₀,F₁,…,F_n-1。

Step two: according to the feature subset F₀,F₁,…,F_n-1Training the corresponding models respectively, denoted as M₀,M₁,…,M_n-1. The model is trained using a machine learning/deep learning algorithm, such as the XGBoost algorithm.

Step three: using model M₀,M₁,…,M_n-1Classifying the malware to be detected to respectively obtain corresponding classification results P₀,P₁,…,P_n-1。

Step four: the mixer off-standard value s-value for each subset of features is calculated.

Step five: according to the result predicted by each model obtained in the third step, s-value of each characteristic subset in the fourth step is s-value₀、s-value₁、……、s-value_n-1And respectively taking the weights of the classification results obtained in the step three to obtain the classification results of the malicious software to be detected.

The method reduces the time for training the model by spatializing the features into a plurality of feature spaces and reducing the dimensionality of the feature spaces, and finally integrates the classification results of a plurality of sub-models into a final malware classification result through s-value. The model training can be performed in parallel for a plurality of feature subsets, and the prediction of new malicious software can be performed in parallel, so that the training time and the prediction time are both reduced.

In conclusion, the invention mainly aims at the problems that the identification accuracy of the deep learning method on the malicious software variants is low and the balance among the model accuracy, the training time and the prediction time is high, and provides an improved method. Soft-dependent (s-value) based on the criterion of the mixture distance in pattern recognition, an evaluation value can be obtained for each feature subset by using s-value, and finally the evaluation value is used as the weight of the corresponding predicted value to finally obtain the prediction result of the final malicious software. By the method, the accuracy of classification of the malicious software is improved, a balance is found among the classification accuracy, the model training time and the prediction time, and the model training time and the prediction time are shortened while the classification accuracy is ensured.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A malware classification and detection method based on soft dependency is characterized by comprising the following steps:

step five: according to the result predicted by each model obtained in the third step, s-value of each characteristic subset in the fourth step is s-value₀,s-value₁,…,s-value_n-1Respectively as the weight of the classification result obtained in the step three, and obtaining the classification result P | | | s-value of the malicious software to be detected₀||·P₀+||s-value₁||·P₁+...+||s-value_n-1||·P_n-1。

2. The soft dependency-based malware classification and detection method as claimed in claim 1, wherein the jth feature subset F in step four_jIs off the standard value s_jComprises the following steps:

s-value_j＝EC_jα+EN_jβ–EI_jγ

3. The soft dependency-based malware classification and detection method as claimed in claim 2, wherein the malware family has a centroid of

α_uIs a feature of the u-th malware sample in the malware family.

4. As claimed in2, the method for classifying and detecting malware based on soft dependency is characterized in that α, β and gamma are obtained by minimizing a loss function in the machine learning process in the second step, wherein the loss function is