CN109858247A - A kind of Malware classification method of three characteristic model of static state based on XGBoost - Google Patents
A kind of Malware classification method of three characteristic model of static state based on XGBoost Download PDFInfo
- Publication number
- CN109858247A CN109858247A CN201811597864.0A CN201811597864A CN109858247A CN 109858247 A CN109858247 A CN 109858247A CN 201811597864 A CN201811597864 A CN 201811597864A CN 109858247 A CN109858247 A CN 109858247A
- Authority
- CN
- China
- Prior art keywords
- sample
- syndrome
- xgboost
- feature
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Abstract
The Malware classification method of this application involves a kind of three characteristic model of static state based on XGBoost.Existing static nature extractive technique relies primarily on the extraction that three byte view (i.e. 16 system bytecodes), compilation view (i.e. asm assembly code), PE view (i.e. PE structured message) aspects carry out feature, though time-consuming short using the method for one of them, accuracy rate may be very low;Although taking three kinds of methods that can effectively improve accuracy rate entirely, it is multiplied relative to time-consuming, accuracy rate still is apparent not enough.The present invention quotes machine learning algorithm XGBoost algorithm, and the feature set after extracting to three of the above carries out algorithm integration operation, obtains higher accuracy rate.
Description
Technical field
The present invention relates to a kind of tagsort method, especially a kind of fusion feature classification method.
Background technique
Existing classifier based on XGBoost model or the static nature utilized are very little or behavioral characteristics are added
Fusion, analysis system can be damaged, it is dangerous.Realistic situation needs on the basis of not failure analysis system environments,
A kind of disaggregated model of Malware more efficient, safer, accuracy rate is relatively higher is provided.
Summary of the invention
In view of the defect of the prior art, the invention proposes a kind of evil of three characteristic model of static state based on XGBoost
Meaning software classification method, with respect to single static feature to the classification method of Malware, three feature XGBoost models are calculated
The accuracy rate for improving classification of method;Relative to the classification method combined using static, behavioral characteristics, three feature XGBoost moulds
Type algorithm improves efficiency and safety again.
The technical solution of use is as follows:
A kind of Malware classification method of three characteristic model of static state based on XGBoost, work step are as follows:
Step 1: it obtains the S01 stage of data set: obtaining raw data set from the website VirusShare first, selection
" VirusShare_00271 " data set total amount is 65 536, and the present invention is directed to the PE file under windows platform, thus
It needs to screen initial data, is instructed herein by this software of PE Exeinfo and order line file, screen out non-PE
File, after remove the indefinite sample of family classification, it is final to obtain 182 different families, 2798 samples, subsequently into S02 rank
Section.
Step 2: extracting the S02 stage of the feature vector of three syndromes: obtaining byte view, the compilation view of sample respectively
Figure, PE view feature, subsequently into the S03 stage.
Step 3: the S03 stage of syndrome combination enumerated and sample characteristics matrix merges: being exactly by each syndrome pair
The sample characteristics matrix answered merges an eigenmatrix, we compile byte view, compilation view, these three syndromes of PE view
Number be 0 ~ 2, then in each combination three feature mass selections and not selecting be represented by syndrome selection sequence I0 ..., Ii ...,
I2 }, Ii ∈ { 0,1 }, i.e., when Ii=1 indicates that selection syndrome i, Ii=0 expression does not select syndrome i.For every
A syndrome combination, when it includes more than one syndrome, it would be desirable to carry out the conjunction of different characteristic group's sample characteristics matrix
And the sample characteristics matrix of a syndrome is incorporated to total sample characteristics matrix every time, subsequently into the S04 stage.
The S04 stage of step 4:XGBoost disaggregated model training, the training objective of each XGBoost disaggregated model are
Learn more regression trees so that we objective function (, wherein N is that training set is soft
Part sample number,Other for the forecaster same clan of training sample i, 0 indicates prediction error, and 1 indicates to predict successfully,It is model to soft
The predicted value of part sample i.For loss function, () be every t regression tree complexity) it is minimum.We adopt
With the method each regression tree of training of iteration, one regression tree of every training updates current entire disaggregated model to software sample
Total predicted value, the generation of regression tree constantly extend the depth of regression tree using the method successively divided, complete regression tree
Generation after, carry out subtracting complexity of the branch to reduce regression tree, after beta pruning, calculate each leaf node to falling into
The wherein current predicted value of software sample, subsequently into the S05 stage.
Step 5: generating the S05 stage of final classification result: test set software sample is input to based on XGBoost's
The entire disaggregated model of multiple features group model blending algorithm training.Software sample is first used into training stage identical method, is obtained
To 7 kinds, (for the three classes syndrome that we use, removing a syndrome does not all have, the different combination in shared 23-1=7 kind
Mode) the corresponding sampling feature vectors of syndrome combination are exported, group using 7 XGBoost disaggregated models
The feature vector of Cheng Xin, is input in logistic regression classifier, obtains the Probability p i that software sample belongs to Malware, work as pi >
When 0.5, software sample is judged as current predictive family type by us.
The screening to behavioral characteristics is eliminated in the step 2, guarantees the safety of analysis system
The merging step of two eigenmatrixes in the step 3 are as follows:
The feature vector number scale of eigenmatrix 1 is offset by a.For sparse features matrix 2, feature vector number is spy
Maximum feature number adds 1 in sign matrix 1.
B enumerates the feature vector of each sample of two eigenmatrixes to be combined.
C will be in the feature vector after 1 slavish copying to merging of matrix characteristic vector.
The feature number of each feature in the feature vector 2 of matrix is added offset by d.The feature being incorporated into after merging
In vector.
The step 4 selects optimum attributes to carry out node split using in Decision Tree Construction, and fission process is complete
Division is then without beta pruning.
The utility model has the advantages that
Compared with prior art, being to ignore behavioral characteristics the advantages of the invention only takes the scheme of static nature to increase point
The safety of analysis system, and improve efficiency.The XGBoost algorithm model used nicety of grading compared with traditional classification method
It is more excellent, improve the accuracy rate of the classification of Malware.
Detailed description of the invention
Fig. 1 is the flow diagram of the invention.
Specific embodiment:
With reference to the accompanying drawing 1, the invention is further elaborated:
A kind of Malware classification method of three characteristic model of static state based on XGBoost, it is characterised in that comprise the steps of:
Step 1: it obtains the S01 stage of data set: obtaining raw data set from the website VirusShare first, selection
" VirusShare_00271 " data set total amount is 65 536, and the present invention is directed to the PE file under windows platform, thus
It needs to screen initial data, is instructed herein by this software of PE Exeinfo and order line file, screen out non-PE
File, after remove the indefinite sample of family classification, it is final to obtain 182 different families, 2798 samples, subsequently into S02 rank
Section.
Step 2: extracting the S02 stage of the feature vector of three syndromes: obtaining byte view, the compilation view of sample respectively
Figure, PE view feature, subsequently into the S03 stage.
Step 3: the S03 stage of syndrome combination enumerated and sample characteristics matrix merges: being exactly by each syndrome pair
The sample characteristics matrix answered merges an eigenmatrix, we compile byte view, compilation view, these three syndromes of PE view
Number be 0 ~ 2, then in each combination three feature mass selections and not selecting be represented by syndrome selection sequence I0 ..., Ii ...,
I2 }, Ii ∈ { 0,1 }, i.e., when Ii=1 indicates that selection syndrome i, Ii=0 expression does not select syndrome i.For every
A syndrome combination, when it includes more than one syndrome, it would be desirable to carry out the conjunction of different characteristic group's sample characteristics matrix
And the sample characteristics matrix of a syndrome is incorporated to total sample characteristics matrix every time, subsequently into the S04 stage.
The S04 stage of step 4:XGBoost disaggregated model training, the training objective of each XGBoost disaggregated model are
Learn more regression trees so that we objective function (, wherein N is that training set is soft
Part sample number,Other for the forecaster same clan of training sample i, 0 indicates prediction error, and 1 indicates to predict successfully,It is model to soft
The predicted value of part sample i.For loss function, () be every t regression tree complexity) it is minimum.We adopt
With the method each regression tree of training of iteration, one regression tree of every training updates current entire disaggregated model to software sample
Total predicted value, the generation of regression tree constantly extend the depth of regression tree using the method successively divided, complete regression tree
Generation after, carry out subtracting complexity of the branch to reduce regression tree, after beta pruning, calculate each leaf node to falling into
The wherein current predicted value of software sample, subsequently into the S05 stage.
Step 5: generating the S05 stage of final classification result: test set software sample is input to based on XGBoost's
The entire disaggregated model of multiple features group model blending algorithm training.Software sample is first used into training stage identical method, is obtained
To 7 kinds, (for the three classes syndrome that we use, removing a syndrome does not all have, the different combination in shared 23-1=7 kind
Mode) the corresponding sampling feature vectors of syndrome combination are exported, group using 7 XGBoost disaggregated models
The feature vector of Cheng Xin, is input in logistic regression classifier, obtains the Probability p i that software sample belongs to Malware, work as pi >
When 0.5, software sample is judged as current predictive family type by us.
The screening to behavioral characteristics is eliminated in the step 2, guarantees the safety of analysis system
The merging step of two eigenmatrixes in the step 3 are as follows:
The feature vector number scale of eigenmatrix 1 is offset by a.For sparse features matrix 2, feature vector number is spy
Maximum feature number adds 1 in sign matrix 1.
B enumerates the feature vector of each sample of two eigenmatrixes to be combined.
C will be in the feature vector after 1 slavish copying to merging of matrix characteristic vector.
The feature number of each feature in the feature vector 2 of matrix is added offset by d.The feature being incorporated into after merging
In vector.
The step 4 selects optimum attributes to carry out node split using in Decision Tree Construction, and fission process is complete
Division is then without beta pruning.
Claims (4)
1. a kind of Malware classification method of three characteristic model of static state based on XGBoost, work step are as follows:
Step 1: it obtains the S01 stage of data set: obtaining raw data set from the website VirusShare first, selection
" VirusShare_00271 " data set total amount is 65 536, and the present invention is directed to the PE file under windows platform, thus
It needs to screen initial data, is instructed herein by this software of PE Exeinfo and order line file, screen out non-PE
File, after remove the indefinite sample of family classification, it is final to obtain 182 different families, 2798 samples, subsequently into S02 rank
Section;
Step 2: extract three syndromes feature vector the S02 stage: respectively obtain sample byte view, compilation view,
PE view feature, subsequently into the S03 stage;
Step 3: the S03 stage of syndrome combination enumerated and sample characteristics matrix merges: being exactly that each syndrome is corresponding
Sample characteristics matrix merges an eigenmatrix, and byte view, compilation view, PE view these three feature group numbers are by we
0 ~ 2, then it three feature mass selections and does not select in each combination and is represented by syndrome selection sequence { I0 ..., Ii ..., I2 },
Ii ∈ { 0,1 }, i.e., when Ii=1 indicates that selection syndrome i, Ii=0 expression does not select syndrome i;
Each syndrome is combined, when it includes more than one syndrome, it would be desirable to it is special to carry out different characteristic group sample
The sample characteristics matrix of one syndrome, is incorporated to total sample characteristics matrix, subsequently into S04 rank by the merging for levying matrix every time
Section;
The S04 stage of step 4:XGBoost disaggregated model training, the training objective of each XGBoost disaggregated model are study
More regression trees so that we objective function (, wherein N is training set software sample
This number,Other for the forecaster same clan of training sample i, 0 indicates prediction error, and 1 indicates to predict successfully,It is model to software sample
The predicted value of this i,
For loss function, () be every t regression tree complexity) it is minimum;We are instructed using the method for iteration
Practice each regression tree, one regression tree of every training updates the current entire disaggregated model predicted value total to software sample, returns
The generation of tree constantly extends the depth of regression tree using the method that successively divides, after the generation for completing regression tree, need into
Row subtracts branch to reduce the complexity of regression tree, after beta pruning, calculate each leaf node to falling into wherein software sample
Current predicted value, subsequently into the S05 stage;
Step 5: generating the S05 stage of final classification result: test set software sample being input to mostly special based on XGBoost
The entire disaggregated model of syndrome Model Fusion algorithm training, first uses training stage identical method for software sample, obtains 7 kinds
(for the three classes syndrome that we use, removing a syndrome does not all have, the different combination in shared 23-1=7 kind)
The corresponding sampling feature vectors of syndrome combination are exported using 7 XGBoost disaggregated models, are formed new
Feature vector is input in logistic regression classifier, obtains the Probability p i that software sample belongs to Malware, as pi > 0.5,
Software sample is judged as current predictive family type by us.
2. a kind of Malware classification method of three characteristic model of static state based on XGBoost according to claim 1,
The screening to behavioral characteristics is eliminated in step 2 described in being characterized in that, guarantees the safety of analysis system.
3. a kind of Malware classification method of three characteristic model of static state based on XGBoost according to claim 1,
The merging step of two eigenmatrixes in step 3 described in being characterized in that are as follows:
The feature vector number scale of eigenmatrix 1 is offset by a;
For sparse features matrix 2, feature vector number is characterized maximum feature number in matrix 1 and adds 1;
B enumerates the feature vector of each sample of two eigenmatrixes to be combined;
C will be in the feature vector after 1 slavish copying to merging of matrix characteristic vector;
The feature number of each feature in the feature vector 2 of matrix is added offset by d;
It is incorporated into the feature vector after merging.
4. a kind of Malware classification method of three characteristic model of static state based on XGBoost according to claim 1,
It is characterized in that the step 4 selects optimum attributes to carry out node split using in Decision Tree Construction, fission process is complete
Division is then without beta pruning.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811597864.0A CN109858247A (en) | 2018-12-26 | 2018-12-26 | A kind of Malware classification method of three characteristic model of static state based on XGBoost |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811597864.0A CN109858247A (en) | 2018-12-26 | 2018-12-26 | A kind of Malware classification method of three characteristic model of static state based on XGBoost |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109858247A true CN109858247A (en) | 2019-06-07 |
Family
ID=66892372
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811597864.0A Withdrawn CN109858247A (en) | 2018-12-26 | 2018-12-26 | A kind of Malware classification method of three characteristic model of static state based on XGBoost |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109858247A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818344A (en) * | 2020-08-17 | 2021-05-18 | 北京辰信领创信息技术有限公司 | Method for improving virus killing rate by applying artificial intelligence algorithm |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107092827A (en) * | 2017-03-30 | 2017-08-25 | 中国民航大学 | A kind of Android malware detection method based on improvement forest algorithm |
US20180121652A1 (en) * | 2016-10-12 | 2018-05-03 | Sichuan University | Kind of malicious software clustering method expressed based on tlsh feature |
-
2018
- 2018-12-26 CN CN201811597864.0A patent/CN109858247A/en not_active Withdrawn
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180121652A1 (en) * | 2016-10-12 | 2018-05-03 | Sichuan University | Kind of malicious software clustering method expressed based on tlsh feature |
CN107092827A (en) * | 2017-03-30 | 2017-08-25 | 中国民航大学 | A kind of Android malware detection method based on improvement forest algorithm |
Non-Patent Citations (1)
Title |
---|
孙博文 等: "基于静态多特征融合的恶意软件分类方法", 《网络与信息安全学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112818344A (en) * | 2020-08-17 | 2021-05-18 | 北京辰信领创信息技术有限公司 | Method for improving virus killing rate by applying artificial intelligence algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111309912B (en) | Text classification method, apparatus, computer device and storage medium | |
CN106909654B (en) | Multi-level classification system and method based on news text information | |
CN110348214B (en) | Method and system for detecting malicious codes | |
CN106484401B (en) | A kind of Automated Refactoring of object-oriented software | |
CN104462301A (en) | Network data processing method and device | |
CN110880019A (en) | Method for adaptively training target domain classification model through unsupervised domain | |
CN109683946B (en) | User comment recommendation method based on code cloning technology | |
CN111310191A (en) | Block chain intelligent contract vulnerability detection method based on deep learning | |
CN109684851A (en) | Evaluation of Software Quality, device, equipment and computer storage medium | |
CN113221960B (en) | Construction method and collection method of high-quality vulnerability data collection model | |
CN109740347A (en) | A kind of identification of the fragile hash function for smart machine firmware and crack method | |
KR101520671B1 (en) | System and method for analysis executable code based on similarity | |
CN109697361A (en) | A kind of wooden horse classification method based on Trojan characteristics | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN106096413A (en) | A kind of malicious code detecting method based on multi-feature fusion and system | |
CN111460452B (en) | Android malicious software detection method based on frequency fingerprint extraction | |
CN110941829B (en) | Large-scale hardware Trojan horse library generation system and method based on generation countermeasure network | |
CN113722711A (en) | Data adding method based on big data security vulnerability mining and artificial intelligence system | |
CN106156107B (en) | Method for discovering news hotspots | |
CN109858247A (en) | A kind of Malware classification method of three characteristic model of static state based on XGBoost | |
CN114386511A (en) | Malicious software family classification method based on multi-dimensional feature fusion and model integration | |
CN109816038A (en) | A kind of Internet of Things firmware program classification method and its device | |
CN110825642B (en) | Software code line-level defect detection method based on deep learning | |
JP2013003611A (en) | Design verification method and program | |
CN115729825A (en) | Fuzzy test case generation method and device of industrial protocol and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
CB02 | Change of applicant information |
Address after: 210012 Jiangsu Province Yuhuatai District Software Avenue 168, 3 buildings, 5 floors Applicant after: Bozhi Safety Technology Co.,Ltd. Address before: 210012 Jiangsu Province Yuhuatai District Software Avenue 168, 3 buildings, 5 floors Applicant before: JIANGSU BOZHI SOFTWARE TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20190607 |
|
WW01 | Invention patent application withdrawn after publication |