Malicious behavior identification method, system and medium based on weak correlation integration strategy
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a malicious behavior identification method, a malicious behavior identification system and a malicious behavior identification medium based on a weak correlation integration strategy.
Background
With the ever-increasing economy and great progress in communication technology, the internet has become an extremely important part of people's work and life. The internet not only greatly promotes the development of human society, but also makes the connection around the world more and more tight. Especially in our country, the internet is growing faster and faster, and at the same time, the related technology is also growing more and more mature. The 44 th statistical report of the development condition of the internet of China indicates that by 6 months in 2019, the scale of the netizens in China reaches 8.54 hundred million, the internet popularity rate reaches 81.2% after the growth of 2598 ten thousand in 2018, and the internet popularity rate is improved by 1.6% after the growth of 2018.
In the internet era, information security protection is ever slow. The flooding of malicious code poses a very serious threat to internet information security. In 2019, 360 mobile phone guards intercept attacks of various fishing websites for nationwide users about 22.8 hundred million times and intercept attacks of malicious programs for nationwide users about 9.5 hundred million times; about 260.9 hundred million harassing calls are intercepted, and about 95.3 hundred million spam short messages are intercepted. According to 360-month safety brain statistics, 412.5 ten thousands of computers under Lesox virus attack are monitored 11 months before 2019, and nearly 4600 cases of Resox complaint cases are processed. From the viewpoint of attack situation and harm degree, the Lessovirus attack is still one of the biggest security threats faced by computers in China at present.
Aiming at the attacks of the huge and various malicious software, tens of thousands of code detection tasks cannot be completed by simply depending on manual detection, many network security researchers also strive to search for more effective defense methods and detection means, the big data technology provides powerful assistance for the purpose, and more network security researchers apply a machine learning algorithm to malicious code classification so as to realize automatic detection of malicious codes. In the industry, at present, all security manufacturers begin to use a malware detection system based on big data by combining machine learning technology and domain expert knowledge.
(1) Aspect of feature extraction
Feature extraction of malicious software is always an important problem, Ravi, Manoharan and the like establish a dynamic malicious software monitoring system, obtain 4-gram statistical features of a Windows API calling sequence when the system runs, generate classification rules by using a correlation mining algorithm, and construct a rule base for software classification; Abou-Assaleh T, Cercone N and the like propose a malware detection model based on N-gram, intercept N-gram sequences of code bytes as features, and use the occurrence frequency of the N-gram sequences in calculating software similarity measurement to realize classification of malware; the method comprises the steps that A, a malicious program detection model is provided in Yangliang, firstly, a cleaned API sequence is input into a Word2vec model, obtained Word vectors are sequentially arranged into a matrix and used as the input of a convolutional neural network, and features are further extracted and classification is realized by utilizing a deep convolutional neural network; an SAE deep neural network is designed by Ye Y, Chen L, Hou S and other people based on Windows API call of software, and a classification model for optimizing tuning parameters is input after feature learning is carried out by adopting a self-encoder, so that malicious software is detected; and (3) carrying out feature representation on an n-gram sequence of the Windows API by adopting ont-hot coding, taking the n-gram sequence as input feature data of a convolutional neural network, and extracting hidden features of the software samples through convolution and pooling so as to realize classification.
(2) Analysis and detection aspects of malware
Traditional attempts have focused primarily on static and dynamic analysis, but the rapid growth and evolution of malware forces researchers to have to push new analysis and detection solutions. Machine learning is one of the innovative technologies applied towards this direction. Xulin et al performed character feature research and analysis on domain names generated by DGA algorithm used in botnet, and clustered domain names with invalid DNS resolution by clustering algorithm, screening out IP addresses with the number of malicious domain names larger than a certain value according to the mapping relation between the single malicious domain name and the multiple IP addresses, combining the screened out IP addresses and the NxDomains into a matrix, and then, performing bipartite graph clustering analysis again, reducing dimensions and searching for a possibly infected host Bots, Ravi and Manohara, proposing a malicious code dynamic detection system based on technologies such as Windows API call sequence frequent item set and naive Bayes, support vector machine and decision tree, etc., Ding, Chen, etc. propose a detection model based on malicious code genes by taking specific key behavior fragments contained in code basic instructions as code characteristics, and ShifuHou, Life Chen, etc. propose a malicious code classification and integration model based on k-means clustering and support vector machine. Much work has focused on building a framework for analysis, obtaining static features, and classifying malware families. Experiments show that the text classification method has a good effect on improving the detection precision of the fuzzy sample. For comparative aspects of various machine learning algorithms, for example, a negative basic random forest and a Support Vector Machine (SVM) application is applied to solve the problem of detecting malicious Application Programming Interface (API) call sequences. Also, by replacing byte sequences with n-grams, Kolter compared the performance of naive bases, decision trees, and SVM in malware detection. In the aspect of data mining clustering technology for detecting malware, Schultz first proposed the use of three different types of static features, a PE header, a string sequence, and a byte sequence. In an alternative approach to exploring and utilizing sample visual features, most studies have considered that malware may be clustered by family or similarity. Subsequently, artificial neural networks are also used for malware detection, and at the same time, there are some new ideas applied to malware detection, such as detecting malware by using image processing techniques.
The prior art has the following defects:
(1) the industrial analysis technology is mainly based on manual analysis of related safety experts, is greatly influenced by the experience of the experts, cannot meet the requirements of a large number of samples, and is low in efficiency and long in time consumption;
(2) static API function features relied on by academic automation recognition technology can make feature extraction difficult due to confusion of malicious software and a shell adding technology;
(3) current methods rely primarily on known malicious code samples, which may render the identification work inefficient or even ineffective if the variants are identified based solely on existing samples.
Disclosure of Invention
The invention mainly aims to overcome the defects of low efficiency, difficult static feature extraction and the like in the existing malicious code identification technology, and provides a malicious behavior identification method, a malicious behavior identification system and a malicious behavior identification medium based on a weak correlation integration strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a malicious behavior identification method based on a weak correlation integration strategy, which comprises the following steps:
randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening out a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of an importance feature set between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as a base model for integrated learning;
determining the integration weight of the base model according to the accuracy of the base model;
and classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
As an optimal technical scheme, the screening of the dynamic behavior characteristics of the malicious code based on the XGBoost specifically comprises the following steps:
exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
back to the step of exhaustively enumerating all available features for each leaf node starting with a tree of depth 0, and continuing the recursive operation until: the yield after the splitting is smaller than a set splitting yield threshold value min _ gain, the yield after the splitting reaches a set maximum depth threshold value max _ depth or the number of samples related in the leaf node after the splitting is smaller than the minimum sample weight and the threshold value min _ child _ leaf;
the times of selecting a certain feature in the model as the splitting feature are used as an index for measuring the feature importance, the more the times, the higher the importance of the feature is, and therefore, a plurality of features with the highest feature importance scores for the base model are screened out, and an importance feature set is constructed.
As a preferred technical solution, for a certain node, the optimal objective function before splitting is as follows:
wherein G isLAnd GRRespectively, the first-order gradient statistics sums, H, of the left child node and right child node sample sets split by the current nodeLAnd HRWhich are the second order gradient statistical sums of the left child node and the right child node sample sets, respectively, λ is the regularization term coefficient of L2, γ is the regularization term coefficient of the complexity of the control tree,
the optimal objective function after splitting is as follows:
as a preferred technical solution, the yield after splitting is:
as a preferred technical solution, the correlation test of the base model based on the weak correlation integration strategy specifically includes:
setting a proper feature association degree threshold, calculating the feature association degree between a pair of base models, and if the feature association degree value exceeds the threshold, rejecting one base model to screen and eliminate the correlation between the base models to obtain a model with low correlation as the base model for ensemble learning.
As a preferred technical solution, the determining the integration weight according to the accuracy of the base model specifically includes:
and calculating the accuracy index of the base model in the verification set, wherein the weight in the final integration result is in direct proportion to the accuracy, namely the more obvious the effect of the base model is, the more weight is distributed to the base model.
As a preferred technical solution, vectors composed of accuracy rates of the respective basis models are mapped to weight value vectors by a softmax function.
As an optimal technical solution, the classifying malicious codes by using a Bagging integration policy specifically includes:
sampling a plurality of sampling sets containing a plurality of training samples from a malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
In another aspect of the present invention, a system for identifying malicious behaviors based on a weak correlation integration policy is further provided, and the system is applied to the method for identifying malicious behaviors based on a weak correlation integration policy, and includes a base model training module, an importance feature screening module, a correlation checking module, and an integration and classification module:
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight.
In still another aspect of the present invention, a storage medium is further provided, which stores a program, and when the program is executed by a processor, the program implements the above malicious behavior identification method based on the weak correlation integration policy.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) according to the invention, the XGboost algorithm is adopted to determine the number of the integrated learning base models in the malicious code identification, so that the problem of selection of the base models in the integrated learning is reduced, and the accuracy of the malicious code identification is improved.
(2) The invention adopts the weak correlation integration strategy of the integrated learning base model, weakens the correlation problem between the base models which commonly exist when the integration strategy is used for solving the malicious code classification task, constructs the single model weight determination model which takes the accuracy as the guide, and completes the efficient and accurate malicious code identification task.
(3) According to the invention, the Bagging strategy of integrated learning is adopted in the classification of malicious codes, and the prediction error of the model is reduced by combining with the single model weight determination model based on the accuracy as the guide, so that the accuracy and the stability of the malicious code identification task are effectively improved.
Drawings
FIG. 1 is a flowchart of a malicious behavior identification method based on a weakly-associated integrated policy according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a technical route for correlation testing of a base model based on a weak correlation integration strategy according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a technical route for determining integration weights of a base model according to the accuracy of the base model according to an embodiment of the present invention;
FIG. 4 is a flowchart of classifying malicious codes by using a Bagging integration policy according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a malicious behavior identification system based on a weakly-correlated integrated policy according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
As shown in fig. 1, the present embodiment provides a malicious behavior identification method based on a weak correlation integration policy, including the following steps:
s1, randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
s2, screening the dynamic behavior characteristics of the malicious codes based on XGboost; screening out a plurality of features with the highest feature importance scores for the base model, and constructing an importance feature set;
(1) introduction of a model: the XGBoost algorithm is an improved version of the GBDT algorithm, and its objective function is:
in the same way to find the loss function
In that
Second order expansion of site, do not first pair l (y)
(i)X) is in
The second order expansion can be obtained:
order to
And record
Is composed of
i、
Is composed of
iThen there are:
and because in the first step
Is known per se, so
Is a constant function, so that it does not produce influence on optimizing target function, and can bring the above-mentioned conclusion into target function Obj
(k)The following can be obtained:
(2) optimizing an objective function: taking the target function of the XGboost algorithm as an example, for any decision tree f
kAssuming the number of leaf nodes, the decision tree is a vector composed of values corresponding to all the nodes
And a function q (×) capable of mapping the feature vectors to leaf nodes:
constructed and each sample data exists on a unique leaf node. Thus decision tree f
kCan be defined as f
k(x)=w
q(x). The complexity of the decision tree may be governed by a regularization term
By definition, the regularization term indicates that the complexity of the decision tree model may be determined by the number of leaf nodes and the L2 norm of the leaf node corresponding value vector w. Definition set I
j={i|q(x
(i)) J is the set of all training samples divided into leaf nodes j, i.e. the set of previous training samples, which is now rewritten to the set of leaf nodes, so the objective function of the XGBoost algorithm can be rewritten as:
the analysis shows that when the step is updated to the first step, and under the condition that the decision tree structure is fixed, the samples of each leaf node are known, so that q (. + -.) and IjAre also known; also because of giAnd hiIs the derivative of step k-1, and is then known, and thus GjAnd HjAre known. Let the objective function Obj(k)The first derivative of (d) is 0, i.e. the corresponding value of leaf node j is found to be:
thus for a fixed-structure decision tree, the optimal objective function Obj is:
the above derivation is based on the fact that the decision tree structure is fixed, however, the number of decision tree structures is infinite, so that it is practically impossible to exhaust all possible decision tree structures, what decision tree structure is optimal? Generally, a greedy policy is used to generate each node of the decision tree, and the XGBoost algorithm processes the overfitting problem in the generation stage of the decision tree, so that an independent pruning stage is not required, and the specific steps can be summarized as follows:
1) exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
2) for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
3) selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
4) going back to the first step, the recursive operation is continued until one of the following specific conditions is satisfied:
a) setting a splitting income threshold min _ gain, and stopping building a decision tree when the income obtained after the splitting is less than the set threshold;
b) setting a maximum depth threshold value max _ depth, and stopping building the decision tree when the tree reaches the maximum depth threshold value;
c) and setting the minimum sample weight and a threshold min _ child _ leaf, and stopping constructing the decision tree when the number of the associated samples in the leaf nodes obtained by splitting is less than the set threshold.
Because a bisection strategy is adopted for a certain node, the left sub-node and the right sub-node correspond to the node respectively, except for the node to be processed currently, the Obj values corresponding to other nodes are not changed, the calculation of the profit only needs to consider the Obj value of the current node, and the optimal objective function for the node before the splitting is as follows:
the optimal objective function after splitting is:
then for this objective function, the post-split yield is:
wherein G isLAnd GRRespectively a left child node and a right child node sample set split by the current nodeFirst order gradient statistical sum ofLAnd HRRespectively taking the second-order gradient statistical sum of the left subnode sample set and the right subnode sample set, wherein lambda is a regularization term coefficient of L2, and gamma is a regularization term coefficient of the complexity of the control tree;
the above formula can be used to determine the most disruptive feature and the optimal feature disruption point.
The method is characterized in that the dynamic behavior characteristics of malicious codes are screened based on XGboost, on the premise that an importance characteristic set of a base model is constructed, the most common index for measuring the importance of characteristics by the XGboost is weight, and the weight represents the number of times that one characteristic is selected as a splitting characteristic in the model. And generating an optimal splitting point and an optimal feature splitting feature by the process, taking the times of selecting a certain feature in the model as the splitting feature as an index for measuring the feature importance, wherein the more times, the higher the importance of the feature is, and accordingly, screening a plurality of features with the highest feature importance scores for the base model to construct an importance feature set.
S3, performing correlation test on the base model based on the weak correlation integration strategy;
the technical route of the model correlation test is shown in fig. 2, and the correlation between the base models can be judged by analyzing the correlation degree of the importance feature sets between different base models, so that the correlation between the base models can be screened and eliminated, and the accuracy of malicious behavior detection can be improved.
More specifically, in this step, a suitable threshold value of the degree of feature association is set (for example, set to 0.8, which indicates that the number of the same features in the sets of importance features of the two base models accounts for 80% of the total number of all the features), and by calculating the degree of feature association between a pair of base models, if the value of the degree of feature association exceeds the threshold value, one of the base models is removed, so as to screen and eliminate the correlation between the base models, and obtain a model with low correlation as the base model for ensemble learning.
S4, determining the integration weight according to the accuracy of the base model;
the technical process of determining the integration weight of the base models is shown in fig. 3, and firstly, the accuracy index of the base models is respectively calculated in the verification set, and the accuracy is expected to be in direct proportion to the weight in the final integration result, that is, the more significant the effect of the base models is, the larger the weight assigned to the base models is, and finally, the vector formed by the accuracy of each base model is mapped into the weight value vector through the softmax function.
And S5, classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
(1) Introduction of ensemble learning: the idea of ensemble learning is to generate a plurality of learners through a certain rule, combine the learners by adopting a certain integration strategy, and finally comprehensively judge and output a final result. Many of the learners in so-called ensemble learning are homogeneous "weak learners". The outstanding advantages of ensemble learning are: the method integrates the advantages of a plurality of base classifiers, can have higher accuracy in a machine learning algorithm, and has higher robustness, better generalization capability and stronger parallel capability. Assuming that the error rates epsilon of the base classifiers are mutually independent and T is the number of the classifiers, the total error rate of the integrated classifier is as follows according to the Hoeffing inequality:
as seen from the right end of the formula, when the integrated classifier number T is sufficiently large, the total error rate tends to 0.
(2) Bagging algorithm flow: as shown in fig. 4, a plurality of sampling sets including a plurality of training samples are sampled from the malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
According to the integrated model prediction result, a malicious code classification effect with higher accuracy can be obtained.
In another embodiment, as shown in fig. 5, a malicious behavior recognition system based on a weak correlation integration strategy is provided, and comprises a base model training module, an importance feature screening module, a correlation checking module and an integration and classification module;
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight. It should be noted that the system provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.
As shown in fig. 6, in another embodiment of the present application, a storage medium is further provided, where a program is stored, and when the program is executed by a processor, the method for identifying malicious behaviors based on a weak correlation integration policy is implemented, specifically:
s1, randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
s2, screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
s3, performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of the importance feature sets between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as the base model for integrated learning;
s4, determining the integration weight according to the accuracy of the base model;
and S5, classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.