CN113704756A - Method, system and medium for detecting robustness of mining type malicious code based on integration strategy - Google Patents

Method, system and medium for detecting robustness of mining type malicious code based on integration strategy Download PDF

Info

Publication number
CN113704756A
CN113704756A CN202110812889.3A CN202110812889A CN113704756A CN 113704756 A CN113704756 A CN 113704756A CN 202110812889 A CN202110812889 A CN 202110812889A CN 113704756 A CN113704756 A CN 113704756A
Authority
CN
China
Prior art keywords
strategy
adopting
model
training
robustness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110812889.3A
Other languages
Chinese (zh)
Inventor
李树栋
厉源
吴晓波
韩伟红
方滨兴
田志宏
殷丽华
顾钊铨
仇晶
唐可可
李默涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202110812889.3A priority Critical patent/CN113704756A/en
Publication of CN113704756A publication Critical patent/CN113704756A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system and a medium for detecting robustness of mining type malicious codes based on an integration strategy, wherein the method comprises the steps of reading character strings of a binary file by a static analysis method; performing feature vectorization by using a TF-IDF algorithm; obtaining a training sample; randomly extracting training samples by adopting a Bagging strategy; adopting a Boosting strategy to train a model according to the extracted training samples and predicting; and taking the average value of the results predicted after multiple times of training. The invention improves the existing static analysis method, only pays attention to the character string distribution characteristics of the mining malicious codes, and the characteristic engineering is simple and quick; in addition, the invention adopts an integrated strategy to design the detection model, integrates the ideas of Bagging and Boosting algorithms, has simple process of constructing the detection model, high prediction speed and high accuracy, improves the robustness of the model, and has more accurate and stable prediction results.

Description

Method, system and medium for detecting robustness of mining type malicious code based on integration strategy
Technical Field
The invention belongs to the technical field of network security, and particularly relates to a method, a system and a medium for detecting robustness of an excavated malicious code based on an integration strategy.
Background
Malicious code is a type of harmful computer code or web script designed to create system vulnerabilities and thereby create backdoors, security hazards, information and data theft, and other potential damage to files and computer systems. Common malicious code includes computer viruses, computer worms, trojan horses, and the like. Trojan horses (trojan horses for short) are a class of programs that appear to function normally, but in practice hide many programs that a user does not want. Mining type malicious code (also known as mining trojans, mining viruses) as a specific type of trojan is mainly benefited by earning encrypted digital currency by invading a computer system and planting a mining machine.
In recent years, the number of malicious codes in networks has continued to increase, and one important reason is that mining-type malicious codes account for an increasing proportion. In 2020, the detection system of kabushi detects on average 360,000 up-to-date malicious files per day, an increase of 5.2% compared to the last year. The main reason for this increase is the large increase in the number of trojans and backdoor programs: the growth rates were 40.5% and 23%, respectively. Specifically, in the aspect of mining viruses, 1,523,148 users are attacked by the mining viruses in 2020, wherein the attack rate is 2.49% in all attacks, and the attack rate is 13.82% in high-risk programs. In China as well, the total amount of virus samples intercepted by the Swiss cloud security system in 2020 is 1.48 hundred million, the number of virus infections is 3.52 hundred million, and the total number of viruses is increased by 43.71 percent compared with the total number of viruses synchronously in 2019. In the report period, 7,728 new Trojan horse viruses account for 52.05% of the total number, which are the first large virus class. It is worth mentioning that the mine digging virus still occupies an important position in 2020, a Reysatex cloud safety system captures 922 ten thousand mine digging virus samples in 2020, the infection frequency is 578 ten thousand, and the total number of the viruses is 332.32% higher than that in 2019.
For the rise of mining-type malicious codes, the reasons can be summarized as the following three points: one is the appreciation and anonymity of virtual currency. In recent years, the prices of various types of digital cryptocurrency are soaring, and according to statistics of a CoinMarketCap website, in 2020, the price of a bitcoin exceeds 5 ten thousand dollars/BTC once, the market value reaches 9200 hundred million dollars, which is 10 times as high as 2019 years, and the highest historical point is reached. Particularly, the menlo money with the contemporaneous price increased by 6 times is recognized as the only money which cannot be tracked at present, so that the network criminal can hide the identity of the network criminal while digging a mine to obtain a profit, and is more and more pursued by the network criminal. Secondly, the operability and the low cost of the ore digging technology. Most mining activities at present tend to use open source programs and register a wallet address, which makes cyber criminals almost do not need to invest much effort to enjoy the results. And thirdly, system-level vulnerabilities layer is endless and easy to be utilized by mining malicious programs. Remote code execution vulnerabilities (RCEs) can enable remote attackers to remotely inject operating system commands or malicious codes directly into a background server so as to control the background system, and the RCEs are also the most frequently used intrusion channels for mining trojans. For example, the mine mining Trojan team z0Miner used Weblogic unauthorized command execution vulnerability (CVE-2020-14882/14883) to attack, and the time to launch the attack was only 15 days after Weblogic official release of security bulletins (2020.10.21). Similarly, RunMiner mining Trojan used Apache Shiro deserialization vulnerability (CVE-2016-.
Mining type malicious code typically exhibits the following three characteristics in a computer system: one is the diversity of propagation modes. The mining type malicious code can be spread through numerous channels such as bugs, weak password blasting, software binding, social engineering and the like. In addition, the mining type malicious codes can carry a worm module with transverse propagation capacity besides the mining module, and can be spread wantonly through holes in enterprise terminals to infect more terminals. Secondly, the harm behavior of concealing conspire. When the mining type malicious code runs, the mining type malicious code occupies a large amount of system resources, so that the system is easy to perceive after being blocked by a user, the mining type malicious code can be protected by using technologies such as disguised system files, no file persistence and the like, and can not be easily cleared even if the mining type malicious code is found by the user, the system resources of the user are occupied for a long time, and mining obtains benefits. And thirdly, the update is due to the duty. At present, most of common excavation type malicious codes can be updated for a long time, such as changing the currency of excavation at any time, increasing more propagation modes, or achieving the purpose of self preservation by increasing confusion modes.
At present, the working mechanism of mining malicious codes mainly comprises two types, namely a webpage mining script and an executable mining file. For the web page mining script, when the mining script is implanted into the web page accessed by the user, the browser analyzes the content of the mining script and executes the mining script, which causes the browser to occupy a large amount of computer resources for mining. The executable mining files can be divided into passive attack types and active attack types according to attack types. The passive attack type is that a user needs to dig a mine after running an executable trojan horse program, and many executable mine digging files have benefit inductivity, and are usually disguised as application programs which are 'urgently needed' by the user, such as external game plug-in and activating tools, or application programs which can enable the user to directly or indirectly obtain benefits, such as on-hook software and an internet bar VIP video player. The typical active attack type is represented by a mine digging Trojan horse botnet, which is generally constructed by a hacker invading a computer and implanting the mine digging Trojan horse, and then using the invaded computer to continue implanting the mine digging Trojan horse into other computers. Starting from the two working mechanisms, a series of detection methods for mining type malicious codes are generated, and the detection methods can be divided into two detection types, namely traditional detection and machine learning. The traditional detection type is a detection method based on a feature signature, and the detection method matches some features of the mining malicious code to be detected with known mining malicious features, so that the detection speed is high, and the false alarm rate is low. The detection types based on machine learning can be divided into two detection methods, namely static analysis and dynamic analysis. The static analysis method is to directly read the content of the program file and extract effective characteristic information without executing the program. The detection method does not need to execute files, has high detection speed and cannot generate malicious behaviors which harm the system. The dynamic analysis method is characterized in that the behavior of the malicious code can be deeply analyzed by actually running the program file in the virtual environment, and the accuracy is high.
The existing detection method for mining type malicious codes can be divided into two detection types, namely traditional detection and machine learning. Traditional detection needs to rely on a large amount of expert experience, and a mining malicious feature library is continuously updated, so that only known malicious codes can be detected. The detection types based on machine learning can be divided into two detection methods, namely static analysis and dynamic analysis. The existing static analysis method requires an analyst to have a high professional level to extract effective features, and the feature extraction under the conditions of file shell adding, resource confusion and the like becomes extremely difficult. The dynamic analysis method needs to monitor the behavior of the program file in real time, which results in a great amount of wasted computing resources and inevitably causes a large time overhead.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art and provide a method, a system and a medium for detecting robustness of an excavated malicious code based on an integration strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
one aspect of the invention provides a mining type malicious code robustness detection method based on an integration strategy, which comprises the following steps:
s1, reading character strings in the binary file sample by adopting a static analysis method;
s2, performing feature vectorization on the read character strings by adopting a TF-IDF algorithm, and respectively obtaining character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples;
s3, randomly extracting training samples by adopting a Bagging strategy;
s4, adopting a Boosting strategy to train a model according to the extracted training samples, and using the model to predict;
and S5, taking two steps of randomly extracting training samples, training models and predicting as a judgment cycle, and taking the average value of results of multiple judgment cycles as a model prediction result.
As a preferred technical solution, the reading of the character string in the binary file sample by using the static analysis method specifically includes: and reading binary file samples in the data set in a binary byte code mode, and then decoding the binary file samples into character strings.
As a preferred technical solution, the performing feature vectorization on the read character string by using the TF-IDF algorithm specifically includes:
generating entries of n-grams for original character strings in the samples;
counting the word frequency TF of each character string;
counting the frequency IDF of the inverse document appearing in each character string;
and calculating the importance of each character string according to the TF and the IDF, and finally obtaining character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes respectively.
As a preferred technical solution, the word frequency TF of each character string is calculated as follows:
Figure BDA0003168865520000051
wherein, TFi,jRepresents the frequency of occurrence of the string i in the sample j; n isi,jRepresents the number of times the character string i appears in the sample j; sigmaknk,jRepresenting the total number of strings in sample j.
As a preferred technical solution, the inverse document frequency IDF for counting the occurrence of each character string is specifically as follows:
Figure BDA0003168865520000052
wherein, IDFi,jRepresenting the inverse document frequency of the character string i in the sample j; | D | represents the total number of samples; i j ∈ djL represents the number of samples containing the string i.
As a preferred technical solution, the importance of each character string calculated according to TF and IDF is specifically as follows:
TF-IDFi,j=TFi,j×IDFi,j
wherein, TF-IDFi,jIndicating the importance of each character string; TFi,jRepresents the frequency of occurrence of the string i in the sample j; IDFi,jRepresenting the inverse document frequency of the string i in sample j.
As a preferred technical scheme, the randomly extracting training samples by using the Bagging strategy specifically comprises:
and obtaining a plurality of sampling sets comprising a plurality of training samples from the existing malicious code data set by repeated re-sampling.
As a preferred technical scheme, the training of the model by using the Boosting strategy according to the extracted training samples specifically comprises:
respectively training by adopting a Boosting strategy based on each sampling set; the Boosting strategy is specifically an XG boost algorithm; the XGboost algorithm takes a CART decision Tree as a sub-model, and realizes the integrated learning of a plurality of CART decision trees by Gradient Tree Boosting to obtain a final model;
the tree generation process of the CART decision tree specifically comprises the following steps: determining whether the node is split or not by calculating information gain before and after the node is split, controlling the depth of the tree by parameters, and pruning to prevent overfitting after the tree is generated;
the Gradient Tree Boosting specifically comprises the following steps: the CART tree generated through the last round of training can learn the deviation between the true value and the predicted values of the previous rounds of models, so that the predicted result obtained by adopting the current model can gradually approach the true value.
In another aspect of the invention, the invention further provides an integration strategy-based robustness detection system for the excavated type malicious code, which is applied to the integration strategy-based robustness detection method for the excavated type malicious code, and comprises a preprocessing module, a feature vectorization module, a sample extraction module, a model training module and a prediction module; the preprocessing module is used for reading character strings in the binary file sample by adopting a static analysis method;
the characteristic vectorization module is used for carrying out characteristic vectorization on the read character strings by adopting a TF-IDF algorithm to respectively obtain character string characteristic matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples;
the sample extraction module is used for randomly extracting training samples by adopting a Bagging strategy;
the model training module is used for training a model by adopting a Boosting strategy according to the extracted training samples and predicting by using the model;
the prediction module randomly extracts training samples, trains models and predicts based on the sample extraction module and the model training module, the two steps are used as a judgment cycle, and the average value of results of multiple judgment cycles is used as a model prediction result.
In another aspect of the present invention, a storage medium is further provided, which stores a program, and when the program is executed by a processor, the method for detecting robustness of a mined malicious code based on an integration policy is implemented.
Compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the method extracts features from the character string level of the sample file, is suitable for the exi file, the elf file, the dll file and other various forms in which malicious codes including the mining Trojan exist, only pays attention to the character string distribution features of the mining malicious codes, and has simple and quick feature engineering; the prior art mostly depends on expert experiences such as a file structure and a working mechanism of a specific malicious code;
(2) the existing static analysis technology is often poor in effect under the conditions of shell adding, confusion and the like, and because part of a character string sequence (such as an operation code) is likely to change, corresponding static characteristics do not have the original meaning any more, and the classification task of malicious codes cannot be supported; however, the above behaviors have no great influence on the distribution of the character strings, because two categories, namely malicious codes and benign codes, exist objectively, it means that the two categories always have differences, and the differences have certain characteristics; when the number of the features capable of reflecting the difference reaches a certain degree, the distinction degree between the two categories can be generated; from this point of view, the method of the present invention is an improvement over existing static analysis methods;
(3) the invention adopts an integrated strategy to design a detection model, has simple construction process, high prediction speed and high accuracy; the detection model is constructed by adopting the idea of fusing the Bagging and Boosting algorithms, so that the robustness of the model is improved, and the prediction result is more accurate and stable;
(4) the method for detecting the robustness of the excavated malicious code based on the integration strategy can be applied to a real network environment, and the excavated malicious code detection task can have higher accuracy, robustness and high efficiency;
(5) in the prior art, for the evaluation of model robustness, the robustness is often improved by enlarging the data scale, but the invention only focuses on the tuning of a machine learning model level, and is suitable for the situation of limited data scale in network security problems including malicious software detection.
Drawings
FIG. 1 is an overall flowchart of a method for detecting robustness of a mined malicious code based on an integration policy according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an original string format of a sample file according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a detection model architecture according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a system for detecting robustness of a mined malicious code based on an integration policy according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Examples
Static analysis: the PE file is regarded as a binary file, the file is read in the form of binary byte codes and then decoded into character strings, and data exploration and feature extraction are carried out from the character string level. A string is a sequence of printable characters in a program binary file, and malware analysts often rely on strings in malicious samples to quickly understand what may happen in them. The strings of a binary file will always contain some key information, such as HTTP and FTP commands to download web pages and files, IP addresses and host names to reveal the address information of the connection, text to interpret the purpose of the binary file, a compiler to create the binary file, and the programming language, embedded scripts or HTML to write the binary file. The original string format of the sample file is shown in fig. 2.
As shown in fig. 1, the present embodiment provides a method for detecting robustness of a mining-type malicious code of an integrated policy, which includes the following steps:
first, characteristic engineering
S1, preprocessing data, and reading character strings in binary file samples in a data set by adopting a static analysis method, wherein the method specifically comprises the following steps:
and reading binary file samples in the data set in a binary byte code mode, and then decoding the binary file samples into character strings.
Further, the data set used in this embodiment is derived from a large number of mining-type malicious codes and non-mining-type malicious code PE programs captured from the internet every day, and the MZ header, PE header, import/export table, and other areas in the sample PE structure are erased. This indicates that dynamic analysis cannot be performed on the sample, but the code instruction characteristics of the mining function still exist, so a static analysis method is adopted in this embodiment.
Furthermore, the static analysis method treats the PE file as a binary file, reads the file in a binary byte code form, decodes the file into a character string, and performs data exploration and feature extraction from the character string level. The string is a sequence of printable characters in a program binary file, and the situation which may occur in the string can be quickly known through the string in a malicious sample. The strings of a binary file will always contain some key information, such as HTTP and FTP commands to download web pages and files, IP addresses and host names to reveal the address information of the connection, text to interpret the purpose of the binary file, a compiler to create the binary file, and the programming language, embedded scripts or HTML to write the binary file. The original string format of the sample file is shown in fig. 2.
The existing static analysis technology is often poor in effect under the conditions of shell adding, confusion and the like, and because part of a character string sequence (such as an operation code) is likely to change, corresponding static characteristics do not have the original meaning any more, and the classification task of malicious codes cannot be supported; however, the above behaviors have no great influence on the distribution of the character strings, because two categories, namely malicious codes and benign codes, exist objectively, it means that the two categories always have differences, and the differences have certain characteristics; when the number of the features capable of reflecting the difference reaches a certain degree, the distinction degree between the two categories can be generated; in this respect, the method of the present invention is an improvement over existing static analysis methods.
S2, vectorizing the features, wherein the importance of the character strings in the samples is in direct proportion to the occurrence times of the character strings in the samples and in inverse proportion to the occurrence times of the character strings in all the samples; based on the basic idea, performing feature vectorization on the read character strings by adopting a TF-IDF algorithm, and respectively obtaining character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples, wherein the feature vectorization specifically comprises the following steps:
s2.1, generating entries of n-grams for original character strings in the sample;
s2.2, counting the word frequency TF of each character string, wherein the word frequency TF is specifically as follows:
Figure BDA0003168865520000101
wherein, TFi,jRepresents the frequency of occurrence of the string i in the sample j; n isi,jRepresents the number of times the character string i appears in the sample j; sigmaknk,jRepresenting the total number of strings in sample j.
S2.3, counting the frequency IDF of the inverse document appearing in each character string, wherein the frequency IDF is specifically as follows:
Figure BDA0003168865520000111
wherein, IDFi,jRepresenting the inverse document frequency of the character string i in the sample j; | D | represents the total number of samples; i j ∈ djL represents the number of samples containing the string i.
S2.4, calculating the importance of each character string according to the TF and the IDF, and finally obtaining character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes respectively, wherein the character string feature matrixes are as follows:
TF-IDFi,j=TFi,j×IDFi,j
wherein, TF-IDFi,jIndicating the importance of each character string; TFi,jRepresents the frequency of occurrence of the string i in the sample j; IDFi,jRepresenting the inverse document frequency of the string i in sample j.
Second, model construction
(1) Integrated learning
Ensemble learning refers to the completion of a learning task by building and combining multiple learners. Common ensemble learning includes Bagging, Boosting and Stacking algorithms. The Bagging algorithm determines the predicted value by randomly taking samples from a training data set, training the model with the taken samples, and voting together through all the models. The Boosting algorithm obtains a strong learner with better performance by constructing a series of models and aggregating the models, and each model in the series can pay more attention to the observation data with errors in the prediction of the previous models in the series in the fitting process. In the Stacking algorithm, a first layer is composed of a plurality of base learners, the input of the first layer is an original training set, and a model of a second layer is retrained by taking the output of the base learners of the first layer as the training set, so that a complete Stacking model is obtained.
The classifier based on the integration strategy integrates the advantages of a plurality of base classifiers, can have higher accuracy in a machine learning algorithm, and has higher robustness, better generalization capability and stronger parallel capability. Assuming that the error rates of the base classifiers are independent from each other, and T is the number of classifiers, the total error rate of the integrated classifier is given by the hoeffing inequality:
Figure BDA0003168865520000121
it is seen from the right end of the formula that when the number T of integrated classifiers is sufficiently large, the total error rate tends to 0, indicating that the classifier based on the integration strategy can achieve as high an accuracy as possible under certain conditions.
(2) Mining type malicious code detection model
In the ensemble learning algorithm, Boosting mainly focuses on reducing learning bias, and Bagging mainly focuses on reducing learning variance. The method aims to design an excavation type malicious code detection model with both accuracy and robustness, so that variance-deviation balance of model prediction needs to be realized as much as possible, as shown in fig. 3:
s3, randomly extracting training samples by adopting a Bagging strategy, which specifically comprises the following steps:
from the existing malicious code data sets (including mining types and non-mining types), a plurality of sampling sets containing a plurality of training samples are obtained through repeated putting-back sampling.
S4, adopting a Boosting strategy to train a model according to the extracted training samples, and using the model to predict;
and respectively training by adopting a Boosting strategy based on each sampling set to obtain a plurality of detection submodels for subsequent sample prediction.
The Boosting strategy specifically adopts an XG boost algorithm;
the XGboost algorithm takes a CART decision Tree as a sub-model, and realizes the integrated learning of a plurality of CART decision trees by Gradient Tree Boosting to obtain a final model;
the CART algorithm embodies a tree generation process during training by adopting a Boosting strategy, namely whether a node is split or not is determined by calculating information gain before and after the node is split, the depth of the tree is controlled by parameters, and pruning is carried out after the tree generation is finished so as to prevent overfitting;
the Gradient Tree Boosting idea is that the deviation between a true value and the predicted values of the previous multi-round models can be learned through the CART Tree generated by the last round of training, so that the predicted result obtained by adopting the current model can gradually approach the true value.
S5, taking steps S3 and S4 as a judgment cycle, and taking the average value of the results of the judgment cycles as the model prediction result.
As shown in fig. 4, in another embodiment of the present application, there is provided an integration policy-based mining-type malicious code robustness detection system, which includes a preprocessing module, a feature vectorization module, a sample extraction module, a model training module, and a prediction module;
the preprocessing module is used for reading character strings in the binary file sample by adopting a static analysis method;
the characteristic vectorization module is used for carrying out characteristic vectorization on the read character strings by adopting a TF-IDF algorithm to respectively obtain character string characteristic matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples;
the sample extraction module is used for randomly extracting training samples by adopting a Bagging strategy;
the model training module is used for training a model by adopting a Boosting strategy according to the extracted training samples and predicting by using the model;
the prediction module randomly extracts training samples, trains models and predicts based on the sample extraction module and the model training module, the two steps are used as a judgment cycle, and the average value of results of multiple judgment cycles is used as a model prediction result.
It should be noted that the system provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the above described functions.
As shown in fig. 5, in another embodiment of the present application, a storage medium is further provided, where a program is stored, and when the program is executed by a processor, the method for detecting robustness of a mined-out malicious code based on an integrated policy according to the embodiment is implemented, specifically:
s1, reading character strings in the binary file sample by adopting a static analysis method;
s2, performing feature vectorization on the read character strings by adopting a TF-IDF algorithm, and respectively obtaining character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples;
s3, randomly extracting training samples by adopting a Bagging strategy;
s4, adopting a Boosting strategy to train a model according to the extracted training samples, and using the model to predict;
and S5, taking two steps of randomly extracting training samples, training models and predicting as a judgment cycle, and taking the average value of results of multiple judgment cycles as a model prediction result.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. The method for detecting the robustness of the mining type malicious code based on the integration strategy is characterized by comprising the following steps of:
reading character strings in the binary file sample by adopting a static analysis method;
performing feature vectorization on the read character strings by adopting a TF-IDF algorithm to respectively obtain character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples;
randomly extracting training samples by adopting a Bagging strategy;
adopting a Boosting strategy training model according to the extracted training samples, and predicting by using the model;
the method takes two steps of randomly extracting training samples, training models and predicting as a judgment period, and takes the average value of results of multiple judgment periods as a model prediction result.
2. The method for detecting the robustness of the excavated malicious code based on the integration strategy according to claim 1, wherein the reading of the character strings in the binary file sample by using the static analysis method specifically comprises: and reading binary file samples in the data set in a binary byte code mode, and then decoding the binary file samples into character strings.
3. The method for detecting the robustness of the excavated malicious code based on the integration strategy according to claim 1, wherein the performing feature vectorization on the read character strings by adopting the TF-IDF algorithm specifically comprises:
generating entries of n-grams for original character strings in the samples;
counting the word frequency TF of each character string;
counting the frequency IDF of the inverse document appearing in each character string;
and calculating the importance of each character string according to the TF and the IDF, and finally obtaining character string feature matrixes of the mining type malicious codes and the non-mining type malicious codes respectively.
4. The method for detecting the robustness of the excavated malicious code according to the claim 3, wherein the word frequency TF for counting the occurrence of each character string is specifically represented by the following formula:
Figure FDA0003168865510000011
wherein, TFi,jRepresenting character stringsi frequency of occurrence in sample j; n isi,jRepresents the number of times the character string i appears in the sample j; sigmaknk,jRepresenting the total number of strings in sample j.
5. The method for detecting the robustness of the excavated malicious code based on the integration strategy according to claim 1, wherein the inverse document frequency IDF for counting the occurrence of each character string is specifically as follows:
Figure FDA0003168865510000021
wherein, IDFi,jRepresenting the inverse document frequency of the character string i in the sample j; | D | represents the total number of samples; l j: i is e djL represents the number of samples containing the string i.
6. The method for detecting robustness of a mined malicious code based on an integration strategy according to claim 1, wherein the importance of each character string calculated according to TF and IDF is specifically as follows:
TF-IDFi,j=TFi,j×IDFi,j
wherein, TF-IDFi,jIndicating the importance of each character string; TFi,jRepresents the frequency of occurrence of the string i in the sample j; IDFi,jRepresenting the inverse document frequency of the string i in sample j.
7. The method for detecting the robustness of the excavated malicious code based on the integration strategy according to claim 1, wherein the randomly extracting training samples by using the Bagging strategy specifically comprises:
and obtaining a plurality of sampling sets comprising a plurality of training samples from the existing malicious code data set by repeated re-sampling.
8. The method for detecting the robustness of the excavated malicious code based on the integration strategy according to the claim 1, wherein the training model adopting the Boosting strategy according to the extracted training samples specifically comprises:
respectively training by adopting a Boosting strategy based on each sampling set; the Boosting strategy is specifically an XG boost algorithm; the XGboost algorithm takes a CART decision Tree as a sub-model, and realizes the integrated learning of a plurality of CART decision trees by Gradient Tree Boosting to obtain a final model;
the tree generation process of the CART decision tree specifically comprises the following steps: determining whether the node is split or not by calculating information gain before and after the node is split, controlling the depth of the tree by parameters, and pruning to prevent overfitting after the tree is generated;
the Gradient Tree Boosting specifically comprises the following steps: the CART tree generated through the last round of training can learn the deviation between the true value and the predicted values of the previous rounds of models, so that the predicted result obtained by adopting the current model can gradually approach the true value.
9. The system for detecting the robustness of the excavated malicious code based on the integrated strategy is characterized by being applied to the method for detecting the robustness of the excavated malicious code based on the integrated strategy, which comprises a preprocessing module, a feature vectorization module, a sample extraction module, a model training module and a prediction module, wherein the feature vectorization module is used for carrying out feature vectorization on the excavated malicious code;
the preprocessing module is used for reading character strings in the binary file sample by adopting a static analysis method;
the characteristic vectorization module is used for carrying out characteristic vectorization on the read character strings by adopting a TF-IDF algorithm to respectively obtain character string characteristic matrixes of the mining type malicious codes and the non-mining type malicious codes as training samples;
the sample extraction module is used for randomly extracting training samples by adopting a Bagging strategy;
the model training module is used for training a model by adopting a Boosting strategy according to the extracted training samples and predicting by using the model;
the prediction module randomly extracts training samples, trains models and predicts based on the sample extraction module and the model training module, the two steps are used as a judgment cycle, and the average value of results of multiple judgment cycles is used as a model prediction result.
10. A storage medium storing a program, characterized in that: the program, when executed by a processor, implements the integrated policy-based quarrying type malicious code robustness detection method of any one of claims 1-8.
CN202110812889.3A 2021-07-19 2021-07-19 Method, system and medium for detecting robustness of mining type malicious code based on integration strategy Pending CN113704756A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110812889.3A CN113704756A (en) 2021-07-19 2021-07-19 Method, system and medium for detecting robustness of mining type malicious code based on integration strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110812889.3A CN113704756A (en) 2021-07-19 2021-07-19 Method, system and medium for detecting robustness of mining type malicious code based on integration strategy

Publications (1)

Publication Number Publication Date
CN113704756A true CN113704756A (en) 2021-11-26

Family

ID=78648903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110812889.3A Pending CN113704756A (en) 2021-07-19 2021-07-19 Method, system and medium for detecting robustness of mining type malicious code based on integration strategy

Country Status (1)

Country Link
CN (1) CN113704756A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573753A (en) * 2018-04-26 2018-09-25 葛晓雪 A kind of XGboost chronic kidney diseases prediction algorithm by stages of fusion Bagging
CN109034658A (en) * 2018-08-22 2018-12-18 重庆邮电大学 A kind of promise breaking consumer's risk prediction technique based on big data finance
CN110046784A (en) * 2018-12-14 2019-07-23 阿里巴巴集团控股有限公司 A kind of risk of user's access determines method and device
CN110289097A (en) * 2019-07-02 2019-09-27 重庆大学 A kind of Pattern Recognition Diagnosis system stacking model based on Xgboost neural network
CN111143842A (en) * 2019-12-12 2020-05-12 广州大学 Malicious code detection method and system
CN111797394A (en) * 2020-06-24 2020-10-20 广州大学 APT organization identification method, system and storage medium based on stacking integration

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573753A (en) * 2018-04-26 2018-09-25 葛晓雪 A kind of XGboost chronic kidney diseases prediction algorithm by stages of fusion Bagging
CN109034658A (en) * 2018-08-22 2018-12-18 重庆邮电大学 A kind of promise breaking consumer's risk prediction technique based on big data finance
CN110046784A (en) * 2018-12-14 2019-07-23 阿里巴巴集团控股有限公司 A kind of risk of user's access determines method and device
CN110289097A (en) * 2019-07-02 2019-09-27 重庆大学 A kind of Pattern Recognition Diagnosis system stacking model based on Xgboost neural network
CN111143842A (en) * 2019-12-12 2020-05-12 广州大学 Malicious code detection method and system
CN111797394A (en) * 2020-06-24 2020-10-20 广州大学 APT organization identification method, system and storage medium based on stacking integration

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
刘勇洪等: "基于MODIS数据的决策树分类方法研究与应用", 《遥感学报》 *
李淑锦等: "LGB-BAG在P2P网贷借款者信用风险评估中的应用", 《技术经济》 *
杨望: "一种基于多特征集成学习的恶意代码静态检测框架", 《计算机研究与发展》 *
汪洁等: "基于集成分类器的恶意网络流量检测", 《通信学报》 *
王天锐等: "基于梅尔倒谱系数、深层卷积和Bagging的环境音分类方法", 《计算机应用》 *
赵小欢等: "基于随机森林算法的网络流量分类方法", 《中国电子科学研究院学报》 *
陈圣灵等: "基于样本权重更新的不平衡数据集成学习方法", 《计算机科学》 *
陈木生等: "三种用于垃圾网页检测的随机欠采样集成分类器", 《计算机应用》 *

Similar Documents

Publication Publication Date Title
Gopinath et al. A comprehensive survey on deep learning based malware detection techniques
Alazab Profiling and classifying the behavior of malicious codes
Aslan et al. A comprehensive review on malware detection approaches
Chumachenko Machine learning methods for malware detection and classification
US10318728B2 (en) Determining permissible activity based on permissible activity rules
Li et al. Malicious mining code detection based on ensemble learning in cloud computing environment
Banin et al. Multinomial malware classification via low-level features
Vidal et al. A novel pattern recognition system for detecting Android malware by analyzing suspicious boot sequences
Khan et al. Deep Learning-Based Hybrid Intelligent Intrusion Detection System.
Nguyen et al. A collaborative approach to early detection of IoT Botnet
Nissim et al. Keeping pace with the creation of new malicious PDF files using an active-learning based detection framework
US9336239B1 (en) System and method for deep packet inspection and intrusion detection
Dai et al. SMASH: A malware detection method based on multi-feature ensemble learning
Aldauiji et al. Utilizing cyber threat hunting techniques to find ransomware attacks: A survey of the state of the art
US10372907B2 (en) System and method of detecting malicious computer systems
Vidal et al. Online masquerade detection resistant to mimicry
Kittilsen Detecting malicious PDF documents
Poudyal et al. Malware analytics: Review of data mining, machine learning and big data perspectives
Xie et al. P-gaussian: provenance-based gaussian distribution for detecting intrusion behavior variants using high efficient and real time memory databases
Akhtar Malware detection and analysis: Challenges and research opportunities
Yuste et al. Optimization of code caves in malware binaries to evade machine learning detectors
Han et al. Build a roadmap for stepping into the field of anti-malware research smoothly
Abbasi Automating Behavior-based Ransomware Analysis, Detection, and Classification Using Machine Learning
Raymond et al. Investigation of Android malware using deep learning approach
Wang et al. MSAAM: A multiscale adaptive attention module for IoT malware detection and family classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination