CN113221112A - Malicious behavior identification method, system and medium based on weak correlation integration strategy - Google Patents

Malicious behavior identification method, system and medium based on weak correlation integration strategy Download PDF

Info

Publication number
CN113221112A
CN113221112A CN202110590847.XA CN202110590847A CN113221112A CN 113221112 A CN113221112 A CN 113221112A CN 202110590847 A CN202110590847 A CN 202110590847A CN 113221112 A CN113221112 A CN 113221112A
Authority
CN
China
Prior art keywords
correlation
integration
feature
base
malicious
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110590847.XA
Other languages
Chinese (zh)
Other versions
CN113221112B (en
Inventor
李树栋
厉源
吴晓波
韩伟红
方滨兴
田志宏
顾钊铨
殷丽华
杨航锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110590847.XA priority Critical patent/CN113221112B/en
Publication of CN113221112A publication Critical patent/CN113221112A/en
Application granted granted Critical
Publication of CN113221112B publication Critical patent/CN113221112B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Abstract

The invention discloses a malicious behavior identification method, a system and a medium based on a weak correlation integration strategy, wherein the method comprises the steps of constructing a base model by utilizing a sample set; screening the dynamic behavior characteristics of the malicious code based on XGboost; performing correlation test on the base model based on a weak correlation integration strategy; determining the integration weight of the base model according to the accuracy of the base model; and classifying the malicious codes based on the Bagging integration strategy. According to the invention, the XGboost algorithm is adopted to determine the number of the integrated learning base models in the malicious code identification, so that the problem of selection of the base models in the integrated learning is reduced, and the accuracy of the malicious code identification is also improved. In addition, the invention adopts a weak correlation integration strategy of an integrated learning base model, weakens the correlation problem between the base models which generally exists when the integration strategy is used for solving the malicious code classification task, constructs a single model weight determination model based on the accuracy as the guide, and completes the efficient and accurate malicious code identification task.

Description

Malicious behavior identification method, system and medium based on weak correlation integration strategy
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a malicious behavior identification method, a malicious behavior identification system and a malicious behavior identification medium based on a weak correlation integration strategy.
Background
With the ever-increasing economy and great progress in communication technology, the internet has become an extremely important part of people's work and life. The internet not only greatly promotes the development of human society, but also makes the connection around the world more and more tight. Especially in our country, the internet is growing faster and faster, and at the same time, the related technology is also growing more and more mature. The 44 th statistical report of the development condition of the internet of China indicates that by 6 months in 2019, the scale of the netizens in China reaches 8.54 hundred million, the internet popularity rate reaches 81.2% after the growth of 2598 ten thousand in 2018, and the internet popularity rate is improved by 1.6% after the growth of 2018.
In the internet era, information security protection is ever slow. The flooding of malicious code poses a very serious threat to internet information security. In 2019, 360 mobile phone guards intercept attacks of various fishing websites for nationwide users about 22.8 hundred million times and intercept attacks of malicious programs for nationwide users about 9.5 hundred million times; about 260.9 hundred million harassing calls are intercepted, and about 95.3 hundred million spam short messages are intercepted. According to 360-month safety brain statistics, 412.5 ten thousands of computers under Lesox virus attack are monitored 11 months before 2019, and nearly 4600 cases of Resox complaint cases are processed. From the viewpoint of attack situation and harm degree, the Lessovirus attack is still one of the biggest security threats faced by computers in China at present.
Aiming at the attacks of the huge and various malicious software, tens of thousands of code detection tasks cannot be completed by simply depending on manual detection, many network security researchers also strive to search for more effective defense methods and detection means, the big data technology provides powerful assistance for the purpose, and more network security researchers apply a machine learning algorithm to malicious code classification so as to realize automatic detection of malicious codes. In the industry, at present, all security manufacturers begin to use a malware detection system based on big data by combining machine learning technology and domain expert knowledge.
(1) Aspect of feature extraction
Feature extraction of malicious software is always an important problem, Ravi, Manoharan and the like establish a dynamic malicious software monitoring system, obtain 4-gram statistical features of a Windows API calling sequence when the system runs, generate classification rules by using a correlation mining algorithm, and construct a rule base for software classification; Abou-Assaleh T, Cercone N and the like propose a malware detection model based on N-gram, intercept N-gram sequences of code bytes as features, and use the occurrence frequency of the N-gram sequences in calculating software similarity measurement to realize classification of malware; the method comprises the steps that A, a malicious program detection model is provided in Yangliang, firstly, a cleaned API sequence is input into a Word2vec model, obtained Word vectors are sequentially arranged into a matrix and used as the input of a convolutional neural network, and features are further extracted and classification is realized by utilizing a deep convolutional neural network; an SAE deep neural network is designed by Ye Y, Chen L, Hou S and other people based on Windows API call of software, and a classification model for optimizing tuning parameters is input after feature learning is carried out by adopting a self-encoder, so that malicious software is detected; and (3) carrying out feature representation on an n-gram sequence of the Windows API by adopting ont-hot coding, taking the n-gram sequence as input feature data of a convolutional neural network, and extracting hidden features of the software samples through convolution and pooling so as to realize classification.
(2) Analysis and detection aspects of malware
Traditional attempts have focused primarily on static and dynamic analysis, but the rapid growth and evolution of malware forces researchers to have to push new analysis and detection solutions. Machine learning is one of the innovative technologies applied towards this direction. Xulin et al performed character feature research and analysis on domain names generated by DGA algorithm used in botnet, and clustered domain names with invalid DNS resolution by clustering algorithm, screening out IP addresses with the number of malicious domain names larger than a certain value according to the mapping relation between the single malicious domain name and the multiple IP addresses, combining the screened out IP addresses and the NxDomains into a matrix, and then, performing bipartite graph clustering analysis again, reducing dimensions and searching for a possibly infected host Bots, Ravi and Manohara, proposing a malicious code dynamic detection system based on technologies such as Windows API call sequence frequent item set and naive Bayes, support vector machine and decision tree, etc., Ding, Chen, etc. propose a detection model based on malicious code genes by taking specific key behavior fragments contained in code basic instructions as code characteristics, and ShifuHou, Life Chen, etc. propose a malicious code classification and integration model based on k-means clustering and support vector machine. Much work has focused on building a framework for analysis, obtaining static features, and classifying malware families. Experiments show that the text classification method has a good effect on improving the detection precision of the fuzzy sample. For comparative aspects of various machine learning algorithms, for example, a negative basic random forest and a Support Vector Machine (SVM) application is applied to solve the problem of detecting malicious Application Programming Interface (API) call sequences. Also, by replacing byte sequences with n-grams, Kolter compared the performance of naive bases, decision trees, and SVM in malware detection. In the aspect of data mining clustering technology for detecting malware, Schultz first proposed the use of three different types of static features, a PE header, a string sequence, and a byte sequence. In an alternative approach to exploring and utilizing sample visual features, most studies have considered that malware may be clustered by family or similarity. Subsequently, artificial neural networks are also used for malware detection, and at the same time, there are some new ideas applied to malware detection, such as detecting malware by using image processing techniques.
The prior art has the following defects:
(1) the industrial analysis technology is mainly based on manual analysis of related safety experts, is greatly influenced by the experience of the experts, cannot meet the requirements of a large number of samples, and is low in efficiency and long in time consumption;
(2) static API function features relied on by academic automation recognition technology can make feature extraction difficult due to confusion of malicious software and a shell adding technology;
(3) current methods rely primarily on known malicious code samples, which may render the identification work inefficient or even ineffective if the variants are identified based solely on existing samples.
Disclosure of Invention
The invention mainly aims to overcome the defects of low efficiency, difficult static feature extraction and the like in the existing malicious code identification technology, and provides a malicious behavior identification method, a malicious behavior identification system and a malicious behavior identification medium based on a weak correlation integration strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention provides a malicious behavior identification method based on a weak correlation integration strategy, which comprises the following steps:
randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening out a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of an importance feature set between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as a base model for integrated learning;
determining the integration weight of the base model according to the accuracy of the base model;
and classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
As an optimal technical scheme, the screening of the dynamic behavior characteristics of the malicious code based on the XGBoost specifically comprises the following steps:
exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
back to the step of exhaustively enumerating all available features for each leaf node starting with a tree of depth 0, and continuing the recursive operation until: the yield after the splitting is smaller than a set splitting yield threshold value min _ gain, the yield after the splitting reaches a set maximum depth threshold value max _ depth or the number of samples related in the leaf node after the splitting is smaller than the minimum sample weight and the threshold value min _ child _ leaf;
the times of selecting a certain feature in the model as the splitting feature are used as an index for measuring the feature importance, the more the times, the higher the importance of the feature is, and therefore, a plurality of features with the highest feature importance scores for the base model are screened out, and an importance feature set is constructed.
As a preferred technical solution, for a certain node, the optimal objective function before splitting is as follows:
Figure BDA0003089255280000041
wherein G isLAnd GRRespectively a left sub-node and a right sub-node split by the current nodeFirst order gradient statistical sum of point sample sets, HLAnd HRWhich are the second order gradient statistical sums of the left child node and the right child node sample sets, respectively, λ is the regularization term coefficient of L2, γ is the regularization term coefficient of the complexity of the control tree,
the optimal objective function after splitting is as follows:
Figure BDA0003089255280000042
as a preferred technical solution, the yield after splitting is:
Figure BDA0003089255280000043
as a preferred technical solution, the correlation test of the base model based on the weak correlation integration strategy specifically includes:
setting a proper feature association degree threshold, calculating the feature association degree between a pair of base models, and if the feature association degree value exceeds the threshold, rejecting one base model to screen and eliminate the correlation between the base models to obtain a model with low correlation as the base model for ensemble learning.
As a preferred technical solution, the determining the integration weight according to the accuracy of the base model specifically includes:
and calculating the accuracy index of the base model in the verification set, wherein the weight in the final integration result is in direct proportion to the accuracy, namely the more obvious the effect of the base model is, the more weight is distributed to the base model.
As a preferred technical solution, vectors composed of accuracy rates of the respective basis models are mapped to weight value vectors by a softmax function.
As an optimal technical solution, the classifying malicious codes by using a Bagging integration policy specifically includes:
sampling a plurality of sampling sets containing a plurality of training samples from a malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
In another aspect of the present invention, a system for identifying malicious behaviors based on a weak correlation integration policy is further provided, and the system is applied to the method for identifying malicious behaviors based on a weak correlation integration policy, and includes a base model training module, an importance feature screening module, a correlation checking module, and an integration and classification module:
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight.
In still another aspect of the present invention, a storage medium is further provided, which stores a program, and when the program is executed by a processor, the program implements the above malicious behavior identification method based on the weak correlation integration policy.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) according to the invention, the XGboost algorithm is adopted to determine the number of the integrated learning base models in the malicious code identification, so that the problem of selection of the base models in the integrated learning is reduced, and the accuracy of the malicious code identification is improved.
(2) The invention adopts the weak correlation integration strategy of the integrated learning base model, weakens the correlation problem between the base models which commonly exist when the integration strategy is used for solving the malicious code classification task, constructs the single model weight determination model which takes the accuracy as the guide, and completes the efficient and accurate malicious code identification task.
(3) According to the invention, the Bagging strategy of integrated learning is adopted in the classification of malicious codes, and the prediction error of the model is reduced by combining with the single model weight determination model based on the accuracy as the guide, so that the accuracy and the stability of the malicious code identification task are effectively improved.
Drawings
FIG. 1 is a flowchart of a malicious behavior identification method based on a weakly-associated integrated policy according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a technical route for correlation testing of a base model based on a weak correlation integration strategy according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a technical route for determining integration weights of a base model according to the accuracy of the base model according to an embodiment of the present invention;
FIG. 4 is a flowchart of classifying malicious codes by using a Bagging integration policy according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a malicious behavior identification system based on a weakly-correlated integrated policy according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
As shown in fig. 1, the present embodiment provides a malicious behavior identification method based on a weak correlation integration policy, including the following steps:
s1, randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
s2, screening the dynamic behavior characteristics of the malicious codes based on XGboost; screening out a plurality of features with the highest feature importance scores for the base model, and constructing an importance feature set;
(1) introduction of a model: the XGBoost algorithm is an improved version of the GBDT algorithm, and its objective function is:
Figure BDA0003089255280000061
in the same way to find the loss function
Figure BDA0003089255280000062
In that
Figure BDA0003089255280000063
Second order expansion of site, do not first pair l (y)(i)X) is in
Figure BDA0003089255280000064
The second order expansion can be obtained:
Figure BDA0003089255280000065
order to
Figure BDA0003089255280000066
And record
Figure BDA0003089255280000067
Is composed ofi
Figure BDA0003089255280000068
Is composed ofiThen there are:
Figure BDA0003089255280000069
and because in the first step
Figure BDA00030892552800000610
Is known per se, so
Figure BDA00030892552800000611
Is a constant function, so that it does not produce influence on optimizing target function, and can bring the above-mentioned conclusion into target function Obj(k)The following can be obtained:
Figure BDA00030892552800000612
(2) optimizing an objective function: taking the target function of the XGboost algorithm as an example, for any decision tree fkAssuming the number of leaf nodes, the decision tree is a vector composed of values corresponding to all the nodes
Figure BDA00030892552800000613
And a function q (×) capable of mapping the feature vectors to leaf nodes:
Figure BDA00030892552800000614
constructed and each sample data exists on a unique leaf node. Thus decision tree fkCan be defined as fk(x)=wq(x). The complexity of the decision tree may be governed by a regularization term
Figure BDA00030892552800000615
By definition, the regularization term indicates that the complexity of the decision tree model may be determined by the number of leaf nodes and the L2 norm of the leaf node corresponding value vector w. Definition set Ij={i|q(x(i)) J is the set of all training samples divided into leaf nodes j, i.e. of previous training samplesThe set, now rewritten to a set of leaf nodes, so the objective function of the XGBoost algorithm can be rewritten as:
Figure BDA0003089255280000071
order to
Figure BDA0003089255280000072
Then there are:
Figure BDA0003089255280000073
the analysis shows that when the step is updated to the first step, and under the condition that the decision tree structure is fixed, the samples of each leaf node are known, so that q (. + -.) and IjAre also known; also because of giAnd hiIs the derivative of step k-1, and is then known, and thus GjAnd HjAre known. Let the objective function Obj(k)The first derivative of (d) is 0, i.e. the corresponding value of leaf node j is found to be:
Figure BDA0003089255280000074
thus for a fixed-structure decision tree, the optimal objective function Obj is:
Figure BDA0003089255280000075
the above derivation is based on the fact that the decision tree structure is fixed, however, the number of decision tree structures is infinite, so that it is practically impossible to exhaust all possible decision tree structures, what decision tree structure is optimal? Generally, a greedy policy is used to generate each node of the decision tree, and the XGBoost algorithm processes the overfitting problem in the generation stage of the decision tree, so that an independent pruning stage is not required, and the specific steps can be summarized as follows:
1) exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
2) for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
3) selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
4) going back to the first step, the recursive operation is continued until one of the following specific conditions is satisfied:
a) setting a splitting income threshold min _ gain, and stopping building a decision tree when the income obtained after the splitting is less than the set threshold;
b) setting a maximum depth threshold value max _ depth, and stopping building the decision tree when the tree reaches the maximum depth threshold value;
c) and setting the minimum sample weight and a threshold min _ child _ leaf, and stopping constructing the decision tree when the number of the associated samples in the leaf nodes obtained by splitting is less than the set threshold.
Because a bisection strategy is adopted for a certain node, the left sub-node and the right sub-node correspond to the node respectively, except for the node to be processed currently, the Obj values corresponding to other nodes are not changed, the calculation of the profit only needs to consider the Obj value of the current node, and the optimal objective function for the node before the splitting is as follows:
Figure BDA0003089255280000081
the optimal objective function after splitting is:
Figure BDA0003089255280000082
then for this objective function, the post-split yield is:
Figure BDA0003089255280000083
wherein G isLAnd GRRespectively, the first-order gradient statistics sums, H, of the left child node and right child node sample sets split by the current nodeLAnd HRRespectively taking the second-order gradient statistical sum of the left subnode sample set and the right subnode sample set, wherein lambda is a regularization term coefficient of L2, and gamma is a regularization term coefficient of the complexity of the control tree;
the above formula can be used to determine the most disruptive feature and the optimal feature disruption point.
The method is characterized in that the dynamic behavior characteristics of malicious codes are screened based on XGboost, on the premise that an importance characteristic set of a base model is constructed, the most common index for measuring the importance of characteristics by the XGboost is weight, and the weight represents the number of times that one characteristic is selected as a splitting characteristic in the model. And generating an optimal splitting point and an optimal feature splitting feature by the process, taking the times of selecting a certain feature in the model as the splitting feature as an index for measuring the feature importance, wherein the more times, the higher the importance of the feature is, and accordingly, screening a plurality of features with the highest feature importance scores for the base model to construct an importance feature set.
S3, performing correlation test on the base model based on the weak correlation integration strategy;
the technical route of the model correlation test is shown in fig. 2, and the correlation between the base models can be judged by analyzing the correlation degree of the importance feature sets between different base models, so that the correlation between the base models can be screened and eliminated, and the accuracy of malicious behavior detection can be improved.
More specifically, in this step, a suitable threshold value of the degree of feature association is set (for example, set to 0.8, which indicates that the number of the same features in the sets of importance features of the two base models accounts for 80% of the total number of all the features), and by calculating the degree of feature association between a pair of base models, if the value of the degree of feature association exceeds the threshold value, one of the base models is removed, so as to screen and eliminate the correlation between the base models, and obtain a model with low correlation as the base model for ensemble learning.
S4, determining the integration weight according to the accuracy of the base model;
the technical process of determining the integration weight of the base models is shown in fig. 3, and firstly, the accuracy index of the base models is respectively calculated in the verification set, and the accuracy is expected to be in direct proportion to the weight in the final integration result, that is, the more significant the effect of the base models is, the larger the weight assigned to the base models is, and finally, the vector formed by the accuracy of each base model is mapped into the weight value vector through the softmax function.
And S5, classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
(1) Introduction of ensemble learning: the idea of ensemble learning is to generate a plurality of learners through a certain rule, combine the learners by adopting a certain integration strategy, and finally comprehensively judge and output a final result. Many of the learners in so-called ensemble learning are homogeneous "weak learners". The outstanding advantages of ensemble learning are: the method integrates the advantages of a plurality of base classifiers, can have higher accuracy in a machine learning algorithm, and has higher robustness, better generalization capability and stronger parallel capability. Assuming that the error rates epsilon of the base classifiers are mutually independent and T is the number of the classifiers, the total error rate of the integrated classifier is as follows according to the Hoeffing inequality:
Figure BDA0003089255280000091
as seen from the right end of the formula, when the integrated classifier number T is sufficiently large, the total error rate tends to 0.
(2) Bagging algorithm flow: as shown in fig. 4, a plurality of sampling sets including a plurality of training samples are sampled from the malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
According to the integrated model prediction result, a malicious code classification effect with higher accuracy can be obtained.
In another embodiment, as shown in fig. 5, a malicious behavior recognition system based on a weak correlation integration strategy is provided, and comprises a base model training module, an importance feature screening module, a correlation checking module and an integration and classification module;
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight. It should be noted that the system provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.
As shown in fig. 6, in another embodiment of the present application, a storage medium is further provided, where a program is stored, and when the program is executed by a processor, the method for identifying malicious behaviors based on a weak correlation integration policy is implemented, specifically:
s1, randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
s2, screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
s3, performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of the importance feature sets between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as the base model for integrated learning;
s4, determining the integration weight according to the accuracy of the base model;
and S5, classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. The malicious behavior identification method based on the weak correlation integration strategy is characterized by comprising the following steps of:
randomly extracting a plurality of groups of training samples based on a Bagging integration strategy, and training the extracted samples based on XGboost to obtain a plurality of base models;
screening the dynamic behavior characteristics of the malicious codes based on XGboost, screening out a plurality of characteristics with the highest characteristic importance scores for the base model, and constructing an importance characteristic set;
performing correlation test on the base models based on a weak correlation integration strategy, judging the correlation between the base models by analyzing the correlation degree of an importance feature set between different base models, and further screening and eliminating the correlation between the base models to obtain a model with low correlation as a base model for integrated learning;
determining the integration weight of the base model according to the accuracy of the base model;
and classifying the malicious codes by adopting a Bagging integration strategy based on the integration weight.
2. The method for identifying malicious behaviors based on weak correlation integration policy according to claim 1, wherein the screening of the dynamic behavior characteristics of the malicious code based on the XGBoost specifically comprises the following steps:
exhaustively enumerating all available features for each leaf node starting with a tree of depth 0;
for each feature, the features of the training samples belonging to the node are arranged in an ascending order, the optimal splitting point of the feature is determined in a linear scanning mode, and the income of the optimal splitting point is adopted;
selecting the feature with the maximum profit as a splitting feature, using the optimal splitting point of the feature as a splitting position, generating two new left and right leaf nodes for the node, and associating a new sample set for each new node;
back to the step of exhaustively enumerating all available features for each leaf node starting with a tree of depth 0, and continuing the recursive operation until: the yield after the splitting is smaller than a set splitting yield threshold value min _ gain, the yield after the splitting reaches a set maximum depth threshold value max _ depth or the number of samples related in the leaf node after the splitting is smaller than the minimum sample weight and the threshold value min _ child _ leaf;
the times of selecting a certain feature in the model as the splitting feature are used as an index for measuring the feature importance, the more the times, the higher the importance of the feature is, and therefore, a plurality of features with the highest feature importance scores for the base model are screened out, and an importance feature set is constructed.
3. The method for identifying malicious behaviors based on weak correlation integration policy according to claim 2, wherein for a certain node, the optimal objective function before splitting is as follows:
Figure FDA0003089255270000011
wherein G isLAnd GRRespectively, the first-order gradient statistics sums, H, of the left child node and right child node sample sets split by the current nodeLAnd HRWhich are the second order gradient statistical sums of the left child node and the right child node sample sets, respectively, λ is the regularization term coefficient of L2, γ is the regularization term coefficient of the complexity of the control tree,
the optimal objective function after splitting is as follows:
Figure FDA0003089255270000021
4. the malicious behavior identification method based on the weak correlation integration strategy as claimed in claim 3, wherein the benefit after splitting is:
Figure FDA0003089255270000022
5. the method for identifying malicious behavior based on weak correlation integration policy according to claim 1, wherein the correlation test of the base model based on the weak correlation integration policy specifically comprises:
setting a proper feature association degree threshold, calculating the feature association degree between a pair of base models, and if the feature association degree value exceeds the threshold, rejecting one base model to screen and eliminate the correlation between the base models to obtain a model with low correlation as the base model for ensemble learning.
6. The malicious behavior identification method based on the weak correlation integration policy according to claim 1, wherein the determining of the integration weight according to the accuracy of the base model specifically comprises:
and calculating the accuracy index of the base model in the verification set, wherein the weight in the final integration result is in direct proportion to the accuracy, namely the more obvious the effect of the base model is, the more weight is distributed to the base model.
7. The malicious behavior identification method based on the weak correlation integration policy according to claim 1, wherein a vector formed by the accuracy rates of the base models is mapped to a weight value vector through a softmax function.
8. The method for identifying malicious behaviors based on a weak correlation integration policy according to claim 1, wherein the classifying malicious codes by using a Bagging integration policy specifically comprises:
sampling a plurality of sampling sets containing a plurality of training samples from a malicious code training data set for a plurality of times;
training a weak correlation base model obtained by the correlation test step based on each sampling set;
and carrying out weighted combination on the prediction results of the base models according to the determined integration weight, and finally determining a predicted value.
9. The system for identifying the malicious behavior based on the weak correlation integration strategy is characterized by being applied to the method for identifying the malicious behavior based on the weak correlation integration strategy, which is disclosed by any one of claims 1 to 8, and comprising a base model training module, an importance feature screening module, a correlation checking module and an integrating and classifying module;
the basic model training module randomly extracts a plurality of groups of training samples based on a Bagging integration strategy, and obtains a plurality of basic models by using the extracted samples based on XGboost training;
the importance feature screening module screens the dynamic behavior features of the malicious codes based on XGboost, screens out a plurality of features with the highest feature importance scores for the base model, and constructs an importance feature set;
the correlation testing module is used for carrying out correlation testing on the base models based on a weak correlation integration strategy, judging the correlation among the base models by analyzing the correlation degree of the importance feature sets among different base models, and further screening and eliminating the correlation among the base models to obtain a model with low correlation as the base model for integrated learning;
and the integration and classification module determines the integration weight according to the accuracy of the base model, and classifies the malicious codes by adopting a Bagging integration strategy based on the integration weight.
10. A storage medium storing a program, characterized in that: the program, when executed by a processor, implements the method for malicious behavior identification based on weakly correlated integration policy of any of claims 1-8.
CN202110590847.XA 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy Active CN113221112B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110590847.XA CN113221112B (en) 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110590847.XA CN113221112B (en) 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy

Publications (2)

Publication Number Publication Date
CN113221112A true CN113221112A (en) 2021-08-06
CN113221112B CN113221112B (en) 2022-03-04

Family

ID=77099616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110590847.XA Active CN113221112B (en) 2021-05-28 2021-05-28 Malicious behavior identification method, system and medium based on weak correlation integration strategy

Country Status (1)

Country Link
CN (1) CN113221112B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961922A (en) * 2021-10-27 2022-01-21 浙江网安信创电子技术有限公司 Malicious software behavior detection and classification system based on deep learning
CN114095268A (en) * 2021-11-26 2022-02-25 河北师范大学 Method, terminal and storage medium for network intrusion detection
CN114528946A (en) * 2021-12-16 2022-05-24 浙江省新型互联网交换中心有限责任公司 Autonomous domain system sibling relation recognition method
CN116155630A (en) * 2023-04-21 2023-05-23 北京邮电大学 Malicious traffic identification method and related equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034050A (en) * 2011-01-25 2011-04-27 四川大学 Dynamic malicious software detection method based on virtual machine and sensitive Native application programming interface (API) calling perception
CN106919841A (en) * 2017-03-10 2017-07-04 西京学院 A kind of efficient Android malware detection model DroidDet based on rotation forest
CN107872457A (en) * 2017-11-09 2018-04-03 北京明朝万达科技股份有限公司 A kind of method and system that network operation is carried out based on predicting network flow
US20190034632A1 (en) * 2017-07-25 2019-01-31 Trend Micro Incorporated Method and system for static behavior-predictive malware detection
CN110135159A (en) * 2019-04-18 2019-08-16 上海交通大学 The identification of malicious code shell and static hulling method and system
CN110414234A (en) * 2019-06-28 2019-11-05 奇安信科技集团股份有限公司 The recognition methods of malicious code family and device
CN111797394A (en) * 2020-06-24 2020-10-20 广州大学 APT organization identification method, system and storage medium based on stacking integration
CN112396188A (en) * 2020-11-19 2021-02-23 深延科技(北京)有限公司 Automatic machine learning and training method, device and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034050A (en) * 2011-01-25 2011-04-27 四川大学 Dynamic malicious software detection method based on virtual machine and sensitive Native application programming interface (API) calling perception
CN106919841A (en) * 2017-03-10 2017-07-04 西京学院 A kind of efficient Android malware detection model DroidDet based on rotation forest
US20190034632A1 (en) * 2017-07-25 2019-01-31 Trend Micro Incorporated Method and system for static behavior-predictive malware detection
CN107872457A (en) * 2017-11-09 2018-04-03 北京明朝万达科技股份有限公司 A kind of method and system that network operation is carried out based on predicting network flow
CN110135159A (en) * 2019-04-18 2019-08-16 上海交通大学 The identification of malicious code shell and static hulling method and system
CN110414234A (en) * 2019-06-28 2019-11-05 奇安信科技集团股份有限公司 The recognition methods of malicious code family and device
CN111797394A (en) * 2020-06-24 2020-10-20 广州大学 APT organization identification method, system and storage medium based on stacking integration
CN112396188A (en) * 2020-11-19 2021-02-23 深延科技(北京)有限公司 Automatic machine learning and training method, device and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HANGFENG YANG 等: "A Novel Solutions for Malicious Code Detection and Family Clustering Based on Machine Learning", 《SECURITY AND PRIVACY IN EMERGING DECENTRALIZED COMMUNICATION ENVIRONMENTS》 *
ZIPEI LI: "Blackmailer or Consumer? A Character-level CNN Approach for Identifying Malicious Complaint Behaviors", 《2020 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC)》 *
李宇雄: "基于在线集成学习的入侵检测方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961922A (en) * 2021-10-27 2022-01-21 浙江网安信创电子技术有限公司 Malicious software behavior detection and classification system based on deep learning
CN114095268A (en) * 2021-11-26 2022-02-25 河北师范大学 Method, terminal and storage medium for network intrusion detection
CN114528946A (en) * 2021-12-16 2022-05-24 浙江省新型互联网交换中心有限责任公司 Autonomous domain system sibling relation recognition method
CN114528946B (en) * 2021-12-16 2022-10-04 浙江省新型互联网交换中心有限责任公司 Autonomous domain system sibling relationship identification method
CN116155630A (en) * 2023-04-21 2023-05-23 北京邮电大学 Malicious traffic identification method and related equipment
CN116155630B (en) * 2023-04-21 2023-07-04 北京邮电大学 Malicious traffic identification method and related equipment

Also Published As

Publication number Publication date
CN113221112B (en) 2022-03-04

Similar Documents

Publication Publication Date Title
CN113221112B (en) Malicious behavior identification method, system and medium based on weak correlation integration strategy
Vinayakumar et al. Evaluating deep learning approaches to characterize and classify the DGAs at scale
CN111027069B (en) Malicious software family detection method, storage medium and computing device
CN110266647B (en) Command and control communication detection method and system
CN110704840A (en) Convolutional neural network CNN-based malicious software detection method
CN112866023B (en) Network detection method, model training method, device, equipment and storage medium
CN112905421A (en) Container abnormal behavior detection method of LSTM network based on attention mechanism
CN112073551B (en) DGA domain name detection system based on character-level sliding window and depth residual error network
CN111382438A (en) Malicious software detection method based on multi-scale convolutional neural network
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
CN111709022B (en) Hybrid alarm association method based on AP clustering and causal relationship
CN114329474A (en) Malicious software detection method integrating machine learning and deep learning
CN112257076B (en) Vulnerability detection method based on random detection algorithm and information aggregation
CN111797997A (en) Network intrusion detection method, model construction method, device and electronic equipment
Jie Research on malicious TLS traffic identification based on hybrid neural network
Ding et al. Detecting Domain Generation Algorithms with Bi-LSTM.
CN115842645A (en) UMAP-RF-based network attack traffic detection method and device and readable storage medium
Lu et al. Stealthy malware detection based on deep neural network
CN114139153A (en) Graph representation learning-based malware interpretability classification method
Rathod et al. Model comparison and multiclass implementation analysis on the unsw nb15 dataset
Karthik et al. Detecting Internet of Things Attacks Using Post Pruning Decision Tree-Synthetic Minority Over Sampling Technique.
CN112149121A (en) Malicious file identification method, device, equipment and storage medium
Avram et al. Tiny network intrusion detection system with high performance
CN113595987B (en) Communication abnormal discovery method and device based on baseline behavior characterization, storage medium and electronic device
Lin et al. Behaviour classification of cyber attacks using convolutional neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant