CN113221112B - Malicious behavior identification method, system and medium based on weak correlation integration strategy - Google Patents

Malicious behavior identification method, system and medium based on weak correlation integration strategy Download PDF

Info

Publication number
CN113221112B
CN113221112B CN202110590847.XA CN202110590847A CN113221112B CN 113221112 B CN113221112 B CN 113221112B CN 202110590847 A CN202110590847 A CN 202110590847A CN 113221112 B CN113221112 B CN 113221112B
Authority
CN
China
Prior art keywords
correlation
feature
base
model
integration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110590847.XA
Other languages
Chinese (zh)
Other versions
CN113221112A (en
Inventor
李树栋
厉源
吴晓波
韩伟红
方滨兴
田志宏
顾钊铨
殷丽华
杨航锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110590847.XA priority Critical patent/CN113221112B/en
Publication of CN113221112A publication Critical patent/CN113221112A/en
Application granted granted Critical
Publication of CN113221112B publication Critical patent/CN113221112B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于弱相关集成策略的恶意行为识别方法、系统和介质,该方法包括利用样本集构建基模型;基于XGBoost对恶意代码动态行为特征进行筛选;基于弱相关集成策略对基模型进行相关性检验;根据基模型的准确率确定其集成权重;基于Bagging集成策略对恶意代码进行分类。本发明在恶意代码识别中首先采用了XGBoost算法来确定集成学习基模型的个数,降低了集成学习中基模型的选择问题,还提高了恶意代码识别的准确性。另外,本发明采用了集成学习基模型的弱相关集成策略,弱化了使用集成策略解决恶意代码分类任务时普遍存在的基模型之间的相关性问题,并且构建了基于以准确率为导向的单模型权重确定模型,完成高效、准确的恶意代码识别任务。

Figure 202110590847

The invention discloses a malicious behavior identification method, system and medium based on a weak correlation integration strategy. The method includes constructing a base model by using a sample set; screening the dynamic behavior characteristics of malicious codes based on XGBoost; Carry out correlation test; determine its ensemble weight according to the accuracy of the base model; classify malicious code based on the Bagging ensemble strategy. The invention firstly adopts the XGBoost algorithm in the malicious code identification to determine the number of ensemble learning base models, which reduces the selection problem of the base model in the ensemble learning, and also improves the accuracy of malicious code identification. In addition, the present invention adopts the weak correlation ensemble strategy of the ensemble learning base model, which weakens the correlation problem between the base models commonly existing when the ensemble strategy is used to solve the malicious code classification task, and constructs an accuracy-oriented single model based on The model weights determine the model to complete the task of efficient and accurate malicious code identification.

Figure 202110590847

Description

Malicious behavior identification method, system and medium based on weak correlation integration strategy
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a malicious behavior identification method, a malicious behavior identification system and a malicious behavior identification medium based on a weak correlation integration strategy.
Background
With the ever-increasing economy and great progress in communication technology, the internet has become an extremely important part of people's work and life. The internet not only greatly promotes the development of human society, but also makes the connection around the world more and more tight. Especially in our country, the internet is growing faster and faster, and at the same time, the related technology is also growing more and more mature. The 44 th statistical report of the development condition of the internet of China indicates that by 6 months in 2019, the scale of the netizens in China reaches 8.54 hundred million, the internet popularity rate reaches 81.2% after the growth of 2598 ten thousand in 2018, and the internet popularity rate is improved by 1.6% after the growth of 2018.
In the internet era, information security protection is ever slow. The flooding of malicious code poses a very serious threat to internet information security. In 2019, 360 mobile phone guards intercept attacks of various fishing websites for nationwide users about 22.8 hundred million times and intercept attacks of malicious programs for nationwide users about 9.5 hundred million times; about 260.9 hundred million harassing calls are intercepted, and about 95.3 hundred million spam short messages are intercepted. According to 360-month safety brain statistics, 412.5 ten thousands of computers under Lesox virus attack are monitored 11 months before 2019, and nearly 4600 cases of Resox complaint cases are processed. From the viewpoint of attack situation and harm degree, the Lessovirus attack is still one of the biggest security threats faced by computers in China at present.
Aiming at the attacks of the huge and various malicious software, tens of thousands of code detection tasks cannot be completed by simply depending on manual detection, many network security researchers also strive to search for more effective defense methods and detection means, the big data technology provides powerful assistance for the purpose, and more network security researchers apply a machine learning algorithm to malicious code classification so as to realize automatic detection of malicious codes. In the industry, at present, all security manufacturers begin to use a malware detection system based on big data by combining machine learning technology and domain expert knowledge.
(1) Aspect of feature extraction
Feature extraction of malicious software is always an important problem, Ravi, Manoharan and the like establish a dynamic malicious software monitoring system, obtain 4-gram statistical features of a Windows API calling sequence when the system runs, generate classification rules by using a correlation mining algorithm, and construct a rule base for software classification; Abou-Assaleh T, Cercone N and the like propose a malware detection model based on N-gram, intercept N-gram sequences of code bytes as features, and use the occurrence frequency of the N-gram sequences in calculating software similarity measurement to realize classification of malware; the method comprises the steps that A, a malicious program detection model is provided in Yangliang, firstly, a cleaned API sequence is input into a Word2vec model, obtained Word vectors are sequentially arranged into a matrix and used as the input of a convolutional neural network, and features are further extracted and classification is realized by utilizing a deep convolutional neural network; an SAE deep neural network is designed by Ye Y, Chen L, Hou S and other people based on Windows API call of software, and a classification model for optimizing tuning parameters is input after feature learning is carried out by adopting a self-encoder, so that malicious software is detected; and (3) carrying out feature representation on an n-gram sequence of the Windows API by adopting ont-hot coding, taking the n-gram sequence as input feature data of a convolutional neural network, and extracting hidden features of the software samples through convolution and pooling so as to realize classification.
(2) Analysis and detection aspects of malware
Traditional attempts have focused primarily on static and dynamic analysis, but the rapid growth and evolution of malware forces researchers to have to push new analysis and detection solutions. Machine learning is one of the innovative technologies applied towards this direction. Xulin et al performed character feature research and analysis on domain names generated by DGA algorithm used in botnet, and clustered domain names with invalid DNS resolution by clustering algorithm, screening out IP addresses with the number of malicious domain names larger than a certain value according to the mapping relation between the single malicious domain name and the multiple IP addresses, combining the screened out IP addresses and the NxDomains into a matrix, and then, performing bipartite graph clustering analysis again, reducing dimensions and searching for a possibly infected host Bots, Ravi and Manohara, proposing a malicious code dynamic detection system based on technologies such as Windows API call sequence frequent item set and naive Bayes, support vector machine and decision tree, etc., Ding, Chen, etc. propose a detection model based on malicious code genes by taking specific key behavior fragments contained in code basic instructions as code characteristics, and ShifuHou, Life Chen, etc. propose a malicious code classification and integration model based on k-means clustering and support vector machine. Much work has focused on building a framework for analysis, obtaining static features, and classifying malware families. Experiments show that the text classification method has a good effect on improving the detection precision of the fuzzy sample. For comparative aspects of various machine learning algorithms, for example, a negative basic random forest and a Support Vector Machine (SVM) application is applied to solve the problem of detecting malicious Application Programming Interface (API) call sequences. Also, by replacing byte sequences with n-grams, Kolter compared the performance of naive bases, decision trees, and SVM in malware detection. In the aspect of data mining clustering technology for detecting malware, Schultz first proposed the use of three different types of static features, a PE header, a string sequence, and a byte sequence. In an alternative approach to exploring and utilizing sample visual features, most studies have considered that malware may be clustered by family or similarity. Subsequently, artificial neural networks are also used for malware detection, and at the same time, there are some new ideas applied to malware detection, such as detecting malware by using image processing techniques.
The prior art has the following defects:
(1) the industrial analysis technology is mainly based on manual analysis of related safety experts, is greatly influenced by the experience of the experts, cannot meet the requirements of a large number of samples, and is low in efficiency and long in time consumption;
(2) static API function features relied on by academic automation recognition technology can make feature extraction difficult due to confusion of malicious software and a shell adding technology;
(3) current methods rely primarily on known malicious code samples, which may render the identification work inefficient or even ineffective if the variants are identified based solely on existing samples.
Disclosure of Invention
The invention mainly aims to overcome the defects of low efficiency, difficult static feature extraction and the like in the existing malicious code identification technology, and provides a malicious behavior identification method, a malicious behavior identification system and a malicious behavior identification medium based on a weak correlation integration strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a malicious behavior identification method based on a weak correlation integration strategy, which comprises the following steps:
randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening out a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of an importance feature set between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as a base model for integrated learning;
determining the integration weight of the base model according to the accuracy of the base model;
and classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
As an optimal technical scheme, the screening of the dynamic behavior characteristics of the malicious code based on the XGBoost specifically comprises the following steps:
exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
back to the step of exhaustively enumerating all available features for each leaf node starting with a tree of depth 0, and continuing the recursive operation until: the yield after the splitting is smaller than a set splitting yield threshold value min _ gain, the yield after the splitting reaches a set maximum depth threshold value max _ depth or the number of samples related in the leaf node after the splitting is smaller than the minimum sample weight and the threshold value min _ child _ leaf;
the times of selecting a certain feature in the model as the splitting feature are used as an index for measuring the feature importance, the more the times, the higher the importance of the feature is, and therefore, a plurality of features with the highest feature importance scores for the base model are screened out, and an importance feature set is constructed.
As a preferred technical solution, for a certain node, the optimal objective function before splitting is as follows:
Figure BDA0003089255280000041
wherein G isLAnd GRRespectively, the first-order gradient statistics sums, H, of the left child node and right child node sample sets split by the current nodeLAnd HRWhich are the second order gradient statistical sums of the left child node and the right child node sample sets, respectively, λ is the regularization term coefficient of L2, γ is the regularization term coefficient of the complexity of the control tree,
the optimal objective function after splitting is as follows:
Figure BDA0003089255280000042
as a preferred technical solution, the yield after splitting is:
Figure BDA0003089255280000043
as a preferred technical solution, the correlation test of the base model based on the weak correlation integration strategy specifically includes:
setting a proper feature association degree threshold, calculating the feature association degree between a pair of base models, and if the feature association degree value exceeds the threshold, rejecting one base model to screen and eliminate the correlation between the base models to obtain a model with low correlation as the base model for ensemble learning.
As a preferred technical solution, the determining the integration weight according to the accuracy of the base model specifically includes:
and calculating the accuracy index of the base model in the verification set, wherein the weight in the final integration result is in direct proportion to the accuracy, namely the more obvious the effect of the base model is, the more weight is distributed to the base model.
As a preferred technical solution, vectors composed of accuracy rates of the respective basis models are mapped to weight value vectors by a softmax function.
As an optimal technical solution, the classifying malicious codes by using a Bagging integration policy specifically includes:
sampling a plurality of sampling sets containing a plurality of training samples from a malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
In another aspect of the present invention, a system for identifying malicious behaviors based on a weak correlation integration policy is further provided, and the system is applied to the method for identifying malicious behaviors based on a weak correlation integration policy, and includes a base model training module, an importance feature screening module, a correlation checking module, and an integration and classification module:
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight.
In still another aspect of the present invention, a storage medium is further provided, which stores a program, and when the program is executed by a processor, the program implements the above malicious behavior identification method based on the weak correlation integration policy.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) according to the invention, the XGboost algorithm is adopted to determine the number of the integrated learning base models in the malicious code identification, so that the problem of selection of the base models in the integrated learning is reduced, and the accuracy of the malicious code identification is improved.
(2) The invention adopts the weak correlation integration strategy of the integrated learning base model, weakens the correlation problem between the base models which commonly exist when the integration strategy is used for solving the malicious code classification task, constructs the single model weight determination model which takes the accuracy as the guide, and completes the efficient and accurate malicious code identification task.
(3) According to the invention, the Bagging strategy of integrated learning is adopted in the classification of malicious codes, and the prediction error of the model is reduced by combining with the single model weight determination model based on the accuracy as the guide, so that the accuracy and the stability of the malicious code identification task are effectively improved.
Drawings
FIG. 1 is a flowchart of a malicious behavior identification method based on a weakly-associated integrated policy according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a technical route for correlation testing of a base model based on a weak correlation integration strategy according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a technical route for determining integration weights of a base model according to the accuracy of the base model according to an embodiment of the present invention;
FIG. 4 is a flowchart of classifying malicious codes by using a Bagging integration policy according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a malicious behavior identification system based on a weakly-correlated integrated policy according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
As shown in fig. 1, the present embodiment provides a malicious behavior identification method based on a weak correlation integration policy, including the following steps:
s1, randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
s2, screening the dynamic behavior characteristics of the malicious codes based on XGboost; screening out a plurality of features with the highest feature importance scores for the base model, and constructing an importance feature set;
(1) introduction of a model: the XGBoost algorithm is an improved version of the GBDT algorithm, and its objective function is:
Figure BDA0003089255280000061
in the same way to find the loss function
Figure BDA0003089255280000062
In that
Figure BDA0003089255280000063
Second order expansion of site, do not first pair l (y)(i)X) is in
Figure BDA0003089255280000064
The second order expansion can be obtained:
Figure BDA0003089255280000065
order to
Figure BDA0003089255280000066
And record
Figure BDA0003089255280000067
Is composed ofi
Figure BDA0003089255280000068
Is composed ofiThen there are:
Figure BDA0003089255280000069
and because in the first step
Figure BDA00030892552800000610
Is known per se, so
Figure BDA00030892552800000611
Is a constant function, so that it does not produce influence on optimizing target function, and can bring the above-mentioned conclusion into target function Obj(k)The following can be obtained:
Figure BDA00030892552800000612
(2) optimizing an objective function: taking the target function of the XGboost algorithm as an example, for any decision tree fkAssuming the number of leaf nodes, the decision tree is a vector composed of values corresponding to all the nodes
Figure BDA00030892552800000613
And a function q (×) capable of mapping the feature vectors to leaf nodes:
Figure BDA00030892552800000614
constructed and each sample data exists on a unique leaf node. Thus decision tree fkCan be defined as fk(x)=wq(x). The complexity of the decision tree may be governed by a regularization term
Figure BDA00030892552800000615
By definition, the regularization term indicates that the complexity of the decision tree model may be determined by the number of leaf nodes and the L2 norm of the leaf node corresponding value vector w. Definition set Ij={i|q(x(i)) J is the set of all training samples divided into leaf nodes j, i.e. the set of previous training samples, which is now rewritten to the set of leaf nodes, so the objective function of the XGBoost algorithm can be rewritten as:
Figure BDA0003089255280000071
order to
Figure BDA0003089255280000072
Then there are:
Figure BDA0003089255280000073
the analysis shows that when the step is updated to the first step, and under the condition that the decision tree structure is fixed, the samples of each leaf node are known, so that q (. + -.) and IjAre also known; also because of giAnd hiIs the derivative of step k-1, and is then known, and thus GjAnd HjAre known. Let the objective function Obj(k)The first derivative of (d) is 0, i.e. the corresponding value of leaf node j is found to be:
Figure BDA0003089255280000074
thus for a fixed-structure decision tree, the optimal objective function Obj is:
Figure BDA0003089255280000075
the above derivation is based on the fact that the decision tree structure is fixed, however, the number of decision tree structures is infinite, so that it is practically impossible to exhaust all possible decision tree structures, what decision tree structure is optimal? Generally, a greedy policy is used to generate each node of the decision tree, and the XGBoost algorithm processes the overfitting problem in the generation stage of the decision tree, so that an independent pruning stage is not required, and the specific steps can be summarized as follows:
1) exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
2) for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
3) selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
4) going back to the first step, the recursive operation is continued until one of the following specific conditions is satisfied:
a) setting a splitting income threshold min _ gain, and stopping building a decision tree when the income obtained after the splitting is less than the set threshold;
b) setting a maximum depth threshold value max _ depth, and stopping building the decision tree when the tree reaches the maximum depth threshold value;
c) and setting the minimum sample weight and a threshold min _ child _ leaf, and stopping constructing the decision tree when the number of the associated samples in the leaf nodes obtained by splitting is less than the set threshold.
Because a bisection strategy is adopted for a certain node, the left sub-node and the right sub-node correspond to the node respectively, except for the node to be processed currently, the Obj values corresponding to other nodes are not changed, the calculation of the profit only needs to consider the Obj value of the current node, and the optimal objective function for the node before the splitting is as follows:
Figure BDA0003089255280000081
the optimal objective function after splitting is:
Figure BDA0003089255280000082
then for this objective function, the post-split yield is:
Figure BDA0003089255280000083
wherein G isLAnd GRRespectively a left child node and a right child node sample set split by the current nodeFirst order gradient statistical sum ofLAnd HRRespectively taking the second-order gradient statistical sum of the left subnode sample set and the right subnode sample set, wherein lambda is a regularization term coefficient of L2, and gamma is a regularization term coefficient of the complexity of the control tree;
the above formula can be used to determine the most disruptive feature and the optimal feature disruption point.
The method is characterized in that the dynamic behavior characteristics of malicious codes are screened based on XGboost, on the premise that an importance characteristic set of a base model is constructed, the most common index for measuring the importance of characteristics by the XGboost is weight, and the weight represents the number of times that one characteristic is selected as a splitting characteristic in the model. And generating an optimal splitting point and an optimal feature splitting feature by the process, taking the times of selecting a certain feature in the model as the splitting feature as an index for measuring the feature importance, wherein the more times, the higher the importance of the feature is, and accordingly, screening a plurality of features with the highest feature importance scores for the base model to construct an importance feature set.
S3, performing correlation test on the base model based on the weak correlation integration strategy;
the technical route of the model correlation test is shown in fig. 2, and the correlation between the base models can be judged by analyzing the correlation degree of the importance feature sets between different base models, so that the correlation between the base models can be screened and eliminated, and the accuracy of malicious behavior detection can be improved.
More specifically, in this step, a suitable threshold value of the degree of feature association is set (for example, set to 0.8, which indicates that the number of the same features in the sets of importance features of the two base models accounts for 80% of the total number of all the features), and by calculating the degree of feature association between a pair of base models, if the value of the degree of feature association exceeds the threshold value, one of the base models is removed, so as to screen and eliminate the correlation between the base models, and obtain a model with low correlation as the base model for ensemble learning.
S4, determining the integration weight according to the accuracy of the base model;
the technical process of determining the integration weight of the base models is shown in fig. 3, and firstly, the accuracy index of the base models is respectively calculated in the verification set, and the accuracy is expected to be in direct proportion to the weight in the final integration result, that is, the more significant the effect of the base models is, the larger the weight assigned to the base models is, and finally, the vector formed by the accuracy of each base model is mapped into the weight value vector through the softmax function.
And S5, classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
(1) Introduction of ensemble learning: the idea of ensemble learning is to generate a plurality of learners through a certain rule, combine the learners by adopting a certain integration strategy, and finally comprehensively judge and output a final result. Many of the learners in so-called ensemble learning are homogeneous "weak learners". The outstanding advantages of ensemble learning are: the method integrates the advantages of a plurality of base classifiers, can have higher accuracy in a machine learning algorithm, and has higher robustness, better generalization capability and stronger parallel capability. Assuming that the error rates epsilon of the base classifiers are mutually independent and T is the number of the classifiers, the total error rate of the integrated classifier is as follows according to the Hoeffing inequality:
Figure BDA0003089255280000091
as seen from the right end of the formula, when the integrated classifier number T is sufficiently large, the total error rate tends to 0.
(2) Bagging algorithm flow: as shown in fig. 4, a plurality of sampling sets including a plurality of training samples are sampled from the malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
According to the integrated model prediction result, a malicious code classification effect with higher accuracy can be obtained.
In another embodiment, as shown in fig. 5, a malicious behavior recognition system based on a weak correlation integration strategy is provided, and comprises a base model training module, an importance feature screening module, a correlation checking module and an integration and classification module;
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight. It should be noted that the system provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.
As shown in fig. 6, in another embodiment of the present application, a storage medium is further provided, where a program is stored, and when the program is executed by a processor, the method for identifying malicious behaviors based on a weak correlation integration policy is implemented, specifically:
s1, randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
s2, screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
s3, performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of the importance feature sets between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as the base model for integrated learning;
s4, determining the integration weight according to the accuracy of the base model;
and S5, classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1.基于弱相关集成策略的恶意行为识别方法,其特征在于,包括下述步骤:1. the malicious behavior identification method based on weak correlation integration strategy is characterized in that, comprises the following steps: 基于Bagging集成策略随机抽取训练样本若干组,用抽取到的样本基于XGBoost训练得到多个基模型;Randomly extract several groups of training samples based on the Bagging integration strategy, and use the extracted samples to train multiple base models based on XGBoost; 基于XGBoost对恶意代码动态行为特征进行筛选,筛选出对基模型来说特征重要性评分最高的若干个特征,构建出重要性特征集合;Screen the dynamic behavior features of malicious code based on XGBoost, screen out several features with the highest feature importance score for the base model, and construct an important feature set; 基于弱相关集成策略对基模型进行相关性检验,凭借分析不同基模型之间重要性特征集合的关联程度,判断出基模型之间的相关性,进而筛选消除基模型之间的相关性,得到相关性低的模型作为集成学习的基模型;Based on the weak correlation integration strategy, the correlation between the base models is tested. By analyzing the correlation degree of the important feature sets between different base models, the correlation between the base models is judged, and then the correlation between the base models is screened and eliminated. The model with low correlation is used as the base model for ensemble learning; 根据基模型的准确率确定其集成权重;Determine its ensemble weight according to the accuracy of the base model; 基于所述集成权重采用Bagging集成策略对恶意代码进行分类。Based on the ensemble weights, the Bagging ensemble strategy is used to classify malicious codes. 2.根据权利要求1所述基于弱相关集成策略的恶意行为识别方法,其特征在于,所述基于XGBoost对恶意代码动态行为特征进行筛选具体包括以下步骤:2. the malicious behavior identification method based on weak correlation integrated strategy according to claim 1, is characterized in that, described based on XGBoost, the dynamic behavior characteristic of malicious code is screened specifically comprises the following steps: 从深度为0的树开始对每个叶子结点穷举所有的可用特征;Starting from a tree with a depth of 0, exhaust all available features for each leaf node; 针对每一个特征,把属于该结点的训练样本的该特征升序排列,通过线性扫描的方式来决定该特征的最佳分裂点,并采用最佳分裂点时的收益;For each feature, arrange the features of the training samples belonging to the node in ascending order, determine the optimal splitting point of the feature by linear scanning, and use the revenue of the optimal splitting point; 选择收益最大的特征作为分裂特征,用该特征的最佳分裂点作为分裂位置,把该结点生成出左右两个新的叶子结点,并为每个新结点关联新的样本集;Select the feature with the largest profit as the splitting feature, use the best splitting point of the feature as the splitting position, generate two new left and right leaf nodes from the node, and associate a new sample set for each new node; 退回到从深度为0的树开始对每个叶子结点穷举所有的可用特征的步骤,继续递归操作,直到:分裂后的收益小于设定的分裂收益阈值min_gain、分裂后的收益达到设定的最大深度阈值max_depth或分裂后的叶子节点中关联的样本数小于最小样本权重和阈值min_child_leaf;Return to the step of exhausting all available features for each leaf node starting from a tree with a depth of 0, and continue the recursive operation until: the revenue after splitting is less than the set split revenue threshold min_gain, and the revenue after splitting reaches the set value The maximum depth threshold max_depth or the number of samples associated in the split leaf node is less than the minimum sample weight and threshold min_child_leaf; 将模型中某个特征被选作分裂特征的次数作为衡量特征重要性的指标,次数越多,说明该特征的重要性越高,据此筛选出对基模型来说特征重要性评分最高的若干个特征,构建出重要性特征集合。The number of times a feature in the model is selected as a splitting feature is used as an indicator to measure the importance of the feature. A feature set is constructed to construct an important feature set. 3.根据权利要求2所述基于弱相关集成策略的恶意行为识别方法,其特征在于,对于某个结点,其分裂前最优目标函数如下式:3. the malicious behavior identification method based on weak correlation integration strategy according to claim 2 is characterized in that, for a certain node, the optimal objective function before its split is as follows:
Figure FDA0003451813600000011
Figure FDA0003451813600000011
其中,GL和GR分别为当前节点分裂出的左子节点和右子节点样本集的一阶梯度统计和,HL和HR分别为左子节点和右子节点样本集的二阶梯度统计和,λ为L2正则化项系数,γ为控制树的复杂度的正则化项系数,Among them, GL and GR are the first-order gradient statistics sum of the left and right child node sample sets split by the current node, respectively, and HL and HR are the second-order gradients of the left and right child node sample sets, respectively Statistical sum, λ is the L2 regularization term coefficient, γ is the regularization term coefficient that controls the complexity of the tree, 其分裂后最优目标函数如下式:The optimal objective function after splitting is as follows:
Figure FDA0003451813600000021
Figure FDA0003451813600000021
4.根据权利要求3所述基于弱相关集成策略的恶意行为识别方法,其特征在于,分裂后的收益为:4. the malicious behavior identification method based on weak correlation integrated strategy according to claim 3 is characterized in that, the income after the split is:
Figure FDA0003451813600000022
Figure FDA0003451813600000022
5.根据权利要求1所述基于弱相关集成策略的恶意行为识别方法,其特征在于,所述基于弱相关集成策略对基模型进行相关性检验具体为:5. the malicious behavior identification method based on weak correlation integrated strategy according to claim 1, is characterized in that, described based on weak correlation integrated strategy carries out correlation test to base model specifically: 设置合适的特征关联程度阈值,计算一对基模型之间的特征关联程度,若该特征关联程度值超过阈值,则剔除其中一个基模型,以此筛选消除基模型之间的相关性,得到相关性低的模型作为集成学习的基模型。Set an appropriate feature correlation degree threshold to calculate the feature correlation degree between a pair of base models. If the feature correlation degree value exceeds the threshold, one of the base models will be eliminated, and the correlation between the base models will be filtered to eliminate the correlation between the base models. The model with low performance is used as the base model for ensemble learning. 6.根据权利要求1所述基于弱相关集成策略的恶意行为识别方法,其特征在于,所述根据基模型的准确率确定其集成权重具体为:6. the malicious behavior identification method based on weak correlation integration strategy according to claim 1, is characterized in that, described according to the accuracy rate of base model to determine its integration weight is specifically: 在验证集中计算基模型的准确率指标,令最终集成结果中的权重和准确率成正比,即基模型的效果越显著,分配给该基模型的权重越大。Calculate the accuracy index of the base model in the validation set, so that the weight in the final integration result is proportional to the accuracy, that is, the more significant the effect of the base model, the greater the weight assigned to the base model. 7.根据权利要求1所述基于弱相关集成策略的恶意行为识别方法,其特征在于,通过softmax函数把各基模型的准确率组成的向量映射为权重值向量。7 . The malicious behavior identification method based on weak correlation integration strategy according to claim 1 , wherein the vector formed by the accuracy rate of each base model is mapped to a weight value vector through a softmax function. 8 . 8.根据权利要求1所述基于弱相关集成策略的恶意行为识别方法,其特征在于,所述采用Bagging集成策略对恶意代码进行分类具体为:8. the malicious behavior identification method based on weak correlation integrated strategy according to claim 1, is characterized in that, described adopting Bagging integrated strategy to classify malicious code is specifically: 从恶意代码训练数据集中,多次采样出若干个包含多个训练样本的采样集;From the malicious code training data set, several sampling sets containing multiple training samples are sampled multiple times; 基于每个采样集分别训练由相关性检验的步骤得到的弱相关的基模型;Based on each sampling set, the weakly correlated base model obtained by the step of correlation checking is trained separately; 将这些基模型的预测结果按已确定的集成权重进行加权结合,最终确定预测值。The prediction results of these base models are weighted and combined according to the determined ensemble weights to finally determine the prediction value. 9.基于弱相关集成策略的恶意行为识别系统,其特征在于,应用于权利要求1-8中任一项所述的基于弱相关集成策略的恶意行为识别方法,包括基模型训练模块、重要性特征筛选模块、相关性检验模块和集成及分类模块;9. The malicious behavior identification system based on weak correlation integrated strategy, is characterized in that, is applied to the malicious behavior identification method based on weak correlation integrated strategy described in any one of claim 1-8, comprises basic model training module, importance Feature screening module, correlation checking module and integration and classification module; 所述基模型训练模块基于Bagging集成策略随机抽取训练样本若干组,用抽取到的样本基于XGBoost训练得到多个基模型;The base model training module randomly selects several groups of training samples based on the Bagging integration strategy, and uses the extracted samples to obtain multiple base models based on XGBoost training; 所述重要性特征筛选模块基于XGBoost对恶意代码动态行为特征进行筛选,筛选出对基模型来说特征重要性评分最高的若干个特征,构建出重要性特征集合;The importance feature screening module screens the dynamic behavior features of malicious code based on XGBoost, selects several features with the highest feature importance score for the base model, and constructs an important feature set; 所述相关性检验模块基于弱相关集成策略对基模型进行相关性检验,凭借分析不同基模型之间重要性特征集合的关联程度,判断出基模型之间的相关性,进而筛选消除基模型之间的相关性,得到相关性低的模型作为集成学习的基模型;The correlation test module performs a correlation test on the base model based on the weak correlation integration strategy, judges the correlation between the base models by analyzing the correlation degree of the important feature sets between different base models, and then filters and eliminates the relationship between the base models. The correlation between them is obtained, and the model with low correlation is obtained as the base model of ensemble learning; 所述集成及分类模块根据基模型的准确率确定其集成权重,基于所述集成权重采用Bagging集成策略对恶意代码进行分类。The integration and classification module determines its integration weight according to the accuracy rate of the base model, and uses the Bagging integration strategy to classify malicious codes based on the integration weight. 10.一种存储介质,存储有程序,其特征在于:所述程序被处理器执行时,实现权利要求1-8任一项所述的基于弱相关集成策略的恶意行为识别方法。10 . A storage medium storing a program, wherein when the program is executed by a processor, the method for identifying malicious behavior based on a weakly correlated integration strategy according to any one of claims 1 to 8 is implemented. 11 .
CN202110590847.XA 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy Active CN113221112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110590847.XA CN113221112B (en) 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110590847.XA CN113221112B (en) 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy

Publications (2)

Publication Number Publication Date
CN113221112A CN113221112A (en) 2021-08-06
CN113221112B true CN113221112B (en) 2022-03-04

Family

ID=77099616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110590847.XA Active CN113221112B (en) 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy

Country Status (1)

Country Link
CN (1) CN113221112B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961922B (en) * 2021-10-27 2023-03-24 浙江网安信创电子技术有限公司 Malicious software behavior detection and classification system based on deep learning
CN114095268A (en) * 2021-11-26 2022-02-25 河北师范大学 Method, terminal and storage medium for network intrusion detection
CN114528946B (en) * 2021-12-16 2022-10-04 浙江省新型互联网交换中心有限责任公司 A Method for Recognition of Sibling Relationships in Autonomous Domain Systems
CN114297924A (en) * 2021-12-27 2022-04-08 杭州迪普科技股份有限公司 Model generation method, device, equipment and computer readable storage medium
CN116155630B (en) * 2023-04-21 2023-07-04 北京邮电大学 Malicious traffic identification method and related equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034050A (en) * 2011-01-25 2011-04-27 四川大学 Dynamic malicious software detection method based on virtual machine and sensitive Native application programming interface (API) calling perception
CN106919841A (en) * 2017-03-10 2017-07-04 西京学院 A kind of efficient Android malware detection model DroidDet based on rotation forest
CN107872457A (en) * 2017-11-09 2018-04-03 北京明朝万达科技股份有限公司 A kind of method and system that network operation is carried out based on predicting network flow
CN110414234A (en) * 2019-06-28 2019-11-05 奇安信科技集团股份有限公司 Malicious code family identification method and device
CN111797394A (en) * 2020-06-24 2020-10-20 广州大学 APT organization identification method, system and storage medium based on stacking integration
CN112396188A (en) * 2020-11-19 2021-02-23 深延科技(北京)有限公司 Automatic machine learning and training method, device and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11481492B2 (en) * 2017-07-25 2022-10-25 Trend Micro Incorporated Method and system for static behavior-predictive malware detection
CN110135159A (en) * 2019-04-18 2019-08-16 上海交通大学 Malicious code shell identification and static unpacking method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034050A (en) * 2011-01-25 2011-04-27 四川大学 Dynamic malicious software detection method based on virtual machine and sensitive Native application programming interface (API) calling perception
CN106919841A (en) * 2017-03-10 2017-07-04 西京学院 A kind of efficient Android malware detection model DroidDet based on rotation forest
CN107872457A (en) * 2017-11-09 2018-04-03 北京明朝万达科技股份有限公司 A kind of method and system that network operation is carried out based on predicting network flow
CN110414234A (en) * 2019-06-28 2019-11-05 奇安信科技集团股份有限公司 Malicious code family identification method and device
CN111797394A (en) * 2020-06-24 2020-10-20 广州大学 APT organization identification method, system and storage medium based on stacking integration
CN112396188A (en) * 2020-11-19 2021-02-23 深延科技(北京)有限公司 Automatic machine learning and training method, device and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Novel Solutions for Malicious Code Detection and Family Clustering Based on Machine Learning;Hangfeng Yang 等;《Security and Privacy in Emerging Decentralized Communication Environments》;20191010;第148853-148860页 *
Blackmailer or Consumer? A Character-level CNN Approach for Identifying Malicious Complaint Behaviors;Zipei Li;《2020 International Conference on Computing, Networking and Communications (ICNC)》;20200330;第41-45页 *
基于在线集成学习的入侵检测方法研究;李宇雄;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200215(第02期);第I139-91页 *

Also Published As

Publication number Publication date
CN113221112A (en) 2021-08-06

Similar Documents

Publication Publication Date Title
CN113221112B (en) Malicious behavior identification method, system and medium based on weak correlation integration strategy
Vinayakumar et al. Evaluating deep learning approaches to characterize and classify the DGAs at scale
CN110704840A (en) Convolutional neural network CNN-based malicious software detection method
CN111027069A (en) Malware family detection method, storage medium and computing device
CN112866023B (en) Network detection method, model training method, device, equipment and storage medium
CN109005145A (en) A kind of malice URL detection system and its method extracted based on automated characterization
CN112492059A (en) DGA domain name detection model training method, DGA domain name detection device and storage medium
CN112348080A (en) RBF improvement method, device and equipment based on industrial control abnormity detection
CN114329474B (en) A malware detection method integrating machine learning and deep learning
CN115987615A (en) Network behavior safety early warning method and system
CN111709022A (en) Hybrid alarm correlation method based on AP clustering and causality
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN117579343A (en) A network intrusion detection method oriented to reinforcement learning proximal policy optimization
CN112257076B (en) Vulnerability detection method based on random detection algorithm and information aggregation
CN118400152A (en) Network intrusion detection method
CN116502091A (en) A Network Intrusion Detection Method Based on LSTM and Attention Mechanism
CN115987552A (en) Network intrusion detection method based on deep learning
Li et al. MDBA: Detecting malware based on bytes n-gram with association mining
CN114697086A (en) Mining Trojan detection method based on depth canonical correlation analysis
CN113836526A (en) Intrusion detection method based on improved immune network algorithm and application thereof
CN117579324B (en) Intrusion detection method based on gating time convolution network and graph
CN118535951A (en) SQL attack identification method and system based on deep learning dynamic target range feature fusion
CN112149121A (en) Malicious file identification method, device, equipment and storage medium
CN114844682B (en) A DGA domain name detection method and system
Ding et al. Detecting Domain Generation Algorithms with Bi-LSTM.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant