CN109858247A - A kind of Malware classification method of three characteristic model of static state based on XGBoost - Google Patents

A kind of Malware classification method of three characteristic model of static state based on XGBoost Download PDF

Info

Publication number
CN109858247A
CN109858247A CN201811597864.0A CN201811597864A CN109858247A CN 109858247 A CN109858247 A CN 109858247A CN 201811597864 A CN201811597864 A CN 201811597864A CN 109858247 A CN109858247 A CN 109858247A
Authority
CN
China
Prior art keywords
sample
syndrome
xgboost
feature
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811597864.0A
Other languages
Chinese (zh)
Inventor
傅涛
王力
郑轶
张腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu's Software Polytron Technologies Inc
Original Assignee
Jiangsu's Software Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu's Software Polytron Technologies Inc filed Critical Jiangsu's Software Polytron Technologies Inc
Priority to CN201811597864.0A priority Critical patent/CN109858247A/en
Publication of CN109858247A publication Critical patent/CN109858247A/en
Withdrawn legal-status Critical Current

Links

Abstract

The Malware classification method of this application involves a kind of three characteristic model of static state based on XGBoost.Existing static nature extractive technique relies primarily on the extraction that three byte view (i.e. 16 system bytecodes), compilation view (i.e. asm assembly code), PE view (i.e. PE structured message) aspects carry out feature, though time-consuming short using the method for one of them, accuracy rate may be very low;Although taking three kinds of methods that can effectively improve accuracy rate entirely, it is multiplied relative to time-consuming, accuracy rate still is apparent not enough.The present invention quotes machine learning algorithm XGBoost algorithm, and the feature set after extracting to three of the above carries out algorithm integration operation, obtains higher accuracy rate.

Description

A kind of Malware classification method of three characteristic model of static state based on XGBoost
Technical field
The present invention relates to a kind of tagsort method, especially a kind of fusion feature classification method.
Background technique
Existing classifier based on XGBoost model or the static nature utilized are very little or behavioral characteristics are added Fusion, analysis system can be damaged, it is dangerous.Realistic situation needs on the basis of not failure analysis system environments, A kind of disaggregated model of Malware more efficient, safer, accuracy rate is relatively higher is provided.
Summary of the invention
In view of the defect of the prior art, the invention proposes a kind of evil of three characteristic model of static state based on XGBoost Meaning software classification method, with respect to single static feature to the classification method of Malware, three feature XGBoost models are calculated The accuracy rate for improving classification of method;Relative to the classification method combined using static, behavioral characteristics, three feature XGBoost moulds Type algorithm improves efficiency and safety again.
The technical solution of use is as follows:
A kind of Malware classification method of three characteristic model of static state based on XGBoost, work step are as follows:
Step 1: it obtains the S01 stage of data set: obtaining raw data set from the website VirusShare first, selection " VirusShare_00271 " data set total amount is 65 536, and the present invention is directed to the PE file under windows platform, thus It needs to screen initial data, is instructed herein by this software of PE Exeinfo and order line file, screen out non-PE File, after remove the indefinite sample of family classification, it is final to obtain 182 different families, 2798 samples, subsequently into S02 rank Section.
Step 2: extracting the S02 stage of the feature vector of three syndromes: obtaining byte view, the compilation view of sample respectively Figure, PE view feature, subsequently into the S03 stage.
Step 3: the S03 stage of syndrome combination enumerated and sample characteristics matrix merges: being exactly by each syndrome pair The sample characteristics matrix answered merges an eigenmatrix, we compile byte view, compilation view, these three syndromes of PE view Number be 0 ~ 2, then in each combination three feature mass selections and not selecting be represented by syndrome selection sequence I0 ..., Ii ..., I2 }, Ii ∈ { 0,1 }, i.e., when Ii=1 indicates that selection syndrome i, Ii=0 expression does not select syndrome i.For every A syndrome combination, when it includes more than one syndrome, it would be desirable to carry out the conjunction of different characteristic group's sample characteristics matrix And the sample characteristics matrix of a syndrome is incorporated to total sample characteristics matrix every time, subsequently into the S04 stage.
The S04 stage of step 4:XGBoost disaggregated model training, the training objective of each XGBoost disaggregated model are Learn more regression trees so that we objective function (, wherein N is that training set is soft Part sample number,Other for the forecaster same clan of training sample i, 0 indicates prediction error, and 1 indicates to predict successfully,It is model to soft The predicted value of part sample i.For loss function, () be every t regression tree complexity) it is minimum.We adopt With the method each regression tree of training of iteration, one regression tree of every training updates current entire disaggregated model to software sample Total predicted value, the generation of regression tree constantly extend the depth of regression tree using the method successively divided, complete regression tree Generation after, carry out subtracting complexity of the branch to reduce regression tree, after beta pruning, calculate each leaf node to falling into The wherein current predicted value of software sample, subsequently into the S05 stage.
Step 5: generating the S05 stage of final classification result: test set software sample is input to based on XGBoost's The entire disaggregated model of multiple features group model blending algorithm training.Software sample is first used into training stage identical method, is obtained To 7 kinds, (for the three classes syndrome that we use, removing a syndrome does not all have, the different combination in shared 23-1=7 kind Mode) the corresponding sampling feature vectors of syndrome combination are exported, group using 7 XGBoost disaggregated models The feature vector of Cheng Xin, is input in logistic regression classifier, obtains the Probability p i that software sample belongs to Malware, work as pi > When 0.5, software sample is judged as current predictive family type by us.
The screening to behavioral characteristics is eliminated in the step 2, guarantees the safety of analysis system
The merging step of two eigenmatrixes in the step 3 are as follows:
The feature vector number scale of eigenmatrix 1 is offset by a.For sparse features matrix 2, feature vector number is spy Maximum feature number adds 1 in sign matrix 1.
B enumerates the feature vector of each sample of two eigenmatrixes to be combined.
C will be in the feature vector after 1 slavish copying to merging of matrix characteristic vector.
The feature number of each feature in the feature vector 2 of matrix is added offset by d.The feature being incorporated into after merging In vector.
The step 4 selects optimum attributes to carry out node split using in Decision Tree Construction, and fission process is complete Division is then without beta pruning.
The utility model has the advantages that
Compared with prior art, being to ignore behavioral characteristics the advantages of the invention only takes the scheme of static nature to increase point The safety of analysis system, and improve efficiency.The XGBoost algorithm model used nicety of grading compared with traditional classification method It is more excellent, improve the accuracy rate of the classification of Malware.
Detailed description of the invention
Fig. 1 is the flow diagram of the invention.
Specific embodiment:
With reference to the accompanying drawing 1, the invention is further elaborated:
A kind of Malware classification method of three characteristic model of static state based on XGBoost, it is characterised in that comprise the steps of:
Step 1: it obtains the S01 stage of data set: obtaining raw data set from the website VirusShare first, selection " VirusShare_00271 " data set total amount is 65 536, and the present invention is directed to the PE file under windows platform, thus It needs to screen initial data, is instructed herein by this software of PE Exeinfo and order line file, screen out non-PE File, after remove the indefinite sample of family classification, it is final to obtain 182 different families, 2798 samples, subsequently into S02 rank Section.
Step 2: extracting the S02 stage of the feature vector of three syndromes: obtaining byte view, the compilation view of sample respectively Figure, PE view feature, subsequently into the S03 stage.
Step 3: the S03 stage of syndrome combination enumerated and sample characteristics matrix merges: being exactly by each syndrome pair The sample characteristics matrix answered merges an eigenmatrix, we compile byte view, compilation view, these three syndromes of PE view Number be 0 ~ 2, then in each combination three feature mass selections and not selecting be represented by syndrome selection sequence I0 ..., Ii ..., I2 }, Ii ∈ { 0,1 }, i.e., when Ii=1 indicates that selection syndrome i, Ii=0 expression does not select syndrome i.For every A syndrome combination, when it includes more than one syndrome, it would be desirable to carry out the conjunction of different characteristic group's sample characteristics matrix And the sample characteristics matrix of a syndrome is incorporated to total sample characteristics matrix every time, subsequently into the S04 stage.
The S04 stage of step 4:XGBoost disaggregated model training, the training objective of each XGBoost disaggregated model are Learn more regression trees so that we objective function (, wherein N is that training set is soft Part sample number,Other for the forecaster same clan of training sample i, 0 indicates prediction error, and 1 indicates to predict successfully,It is model to soft The predicted value of part sample i.For loss function, () be every t regression tree complexity) it is minimum.We adopt With the method each regression tree of training of iteration, one regression tree of every training updates current entire disaggregated model to software sample Total predicted value, the generation of regression tree constantly extend the depth of regression tree using the method successively divided, complete regression tree Generation after, carry out subtracting complexity of the branch to reduce regression tree, after beta pruning, calculate each leaf node to falling into The wherein current predicted value of software sample, subsequently into the S05 stage.
Step 5: generating the S05 stage of final classification result: test set software sample is input to based on XGBoost's The entire disaggregated model of multiple features group model blending algorithm training.Software sample is first used into training stage identical method, is obtained To 7 kinds, (for the three classes syndrome that we use, removing a syndrome does not all have, the different combination in shared 23-1=7 kind Mode) the corresponding sampling feature vectors of syndrome combination are exported, group using 7 XGBoost disaggregated models The feature vector of Cheng Xin, is input in logistic regression classifier, obtains the Probability p i that software sample belongs to Malware, work as pi > When 0.5, software sample is judged as current predictive family type by us.
The screening to behavioral characteristics is eliminated in the step 2, guarantees the safety of analysis system
The merging step of two eigenmatrixes in the step 3 are as follows:
The feature vector number scale of eigenmatrix 1 is offset by a.For sparse features matrix 2, feature vector number is spy Maximum feature number adds 1 in sign matrix 1.
B enumerates the feature vector of each sample of two eigenmatrixes to be combined.
C will be in the feature vector after 1 slavish copying to merging of matrix characteristic vector.
The feature number of each feature in the feature vector 2 of matrix is added offset by d.The feature being incorporated into after merging In vector.
The step 4 selects optimum attributes to carry out node split using in Decision Tree Construction, and fission process is complete Division is then without beta pruning.

Claims (4)

1. a kind of Malware classification method of three characteristic model of static state based on XGBoost, work step are as follows:
Step 1: it obtains the S01 stage of data set: obtaining raw data set from the website VirusShare first, selection " VirusShare_00271 " data set total amount is 65 536, and the present invention is directed to the PE file under windows platform, thus It needs to screen initial data, is instructed herein by this software of PE Exeinfo and order line file, screen out non-PE File, after remove the indefinite sample of family classification, it is final to obtain 182 different families, 2798 samples, subsequently into S02 rank Section;
Step 2: extract three syndromes feature vector the S02 stage: respectively obtain sample byte view, compilation view, PE view feature, subsequently into the S03 stage;
Step 3: the S03 stage of syndrome combination enumerated and sample characteristics matrix merges: being exactly that each syndrome is corresponding Sample characteristics matrix merges an eigenmatrix, and byte view, compilation view, PE view these three feature group numbers are by we 0 ~ 2, then it three feature mass selections and does not select in each combination and is represented by syndrome selection sequence { I0 ..., Ii ..., I2 }, Ii ∈ { 0,1 }, i.e., when Ii=1 indicates that selection syndrome i, Ii=0 expression does not select syndrome i;
Each syndrome is combined, when it includes more than one syndrome, it would be desirable to it is special to carry out different characteristic group sample The sample characteristics matrix of one syndrome, is incorporated to total sample characteristics matrix, subsequently into S04 rank by the merging for levying matrix every time Section;
The S04 stage of step 4:XGBoost disaggregated model training, the training objective of each XGBoost disaggregated model are study More regression trees so that we objective function (, wherein N is training set software sample This number,Other for the forecaster same clan of training sample i, 0 indicates prediction error, and 1 indicates to predict successfully,It is model to software sample The predicted value of this i,
For loss function, () be every t regression tree complexity) it is minimum;We are instructed using the method for iteration Practice each regression tree, one regression tree of every training updates the current entire disaggregated model predicted value total to software sample, returns The generation of tree constantly extends the depth of regression tree using the method that successively divides, after the generation for completing regression tree, need into Row subtracts branch to reduce the complexity of regression tree, after beta pruning, calculate each leaf node to falling into wherein software sample Current predicted value, subsequently into the S05 stage;
Step 5: generating the S05 stage of final classification result: test set software sample being input to mostly special based on XGBoost The entire disaggregated model of syndrome Model Fusion algorithm training, first uses training stage identical method for software sample, obtains 7 kinds (for the three classes syndrome that we use, removing a syndrome does not all have, the different combination in shared 23-1=7 kind) The corresponding sampling feature vectors of syndrome combination are exported using 7 XGBoost disaggregated models, are formed new Feature vector is input in logistic regression classifier, obtains the Probability p i that software sample belongs to Malware, as pi > 0.5, Software sample is judged as current predictive family type by us.
2. a kind of Malware classification method of three characteristic model of static state based on XGBoost according to claim 1, The screening to behavioral characteristics is eliminated in step 2 described in being characterized in that, guarantees the safety of analysis system.
3. a kind of Malware classification method of three characteristic model of static state based on XGBoost according to claim 1, The merging step of two eigenmatrixes in step 3 described in being characterized in that are as follows:
The feature vector number scale of eigenmatrix 1 is offset by a;
For sparse features matrix 2, feature vector number is characterized maximum feature number in matrix 1 and adds 1;
B enumerates the feature vector of each sample of two eigenmatrixes to be combined;
C will be in the feature vector after 1 slavish copying to merging of matrix characteristic vector;
The feature number of each feature in the feature vector 2 of matrix is added offset by d;
It is incorporated into the feature vector after merging.
4. a kind of Malware classification method of three characteristic model of static state based on XGBoost according to claim 1, It is characterized in that the step 4 selects optimum attributes to carry out node split using in Decision Tree Construction, fission process is complete Division is then without beta pruning.
CN201811597864.0A 2018-12-26 2018-12-26 A kind of Malware classification method of three characteristic model of static state based on XGBoost Withdrawn CN109858247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811597864.0A CN109858247A (en) 2018-12-26 2018-12-26 A kind of Malware classification method of three characteristic model of static state based on XGBoost

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811597864.0A CN109858247A (en) 2018-12-26 2018-12-26 A kind of Malware classification method of three characteristic model of static state based on XGBoost

Publications (1)

Publication Number Publication Date
CN109858247A true CN109858247A (en) 2019-06-07

Family

ID=66892372

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811597864.0A Withdrawn CN109858247A (en) 2018-12-26 2018-12-26 A kind of Malware classification method of three characteristic model of static state based on XGBoost

Country Status (1)

Country Link
CN (1) CN109858247A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818344A (en) * 2020-08-17 2021-05-18 北京辰信领创信息技术有限公司 Method for improving virus killing rate by applying artificial intelligence algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107092827A (en) * 2017-03-30 2017-08-25 中国民航大学 A kind of Android malware detection method based on improvement forest algorithm
US20180121652A1 (en) * 2016-10-12 2018-05-03 Sichuan University Kind of malicious software clustering method expressed based on tlsh feature

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121652A1 (en) * 2016-10-12 2018-05-03 Sichuan University Kind of malicious software clustering method expressed based on tlsh feature
CN107092827A (en) * 2017-03-30 2017-08-25 中国民航大学 A kind of Android malware detection method based on improvement forest algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙博文 等: "基于静态多特征融合的恶意软件分类方法", 《网络与信息安全学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818344A (en) * 2020-08-17 2021-05-18 北京辰信领创信息技术有限公司 Method for improving virus killing rate by applying artificial intelligence algorithm

Similar Documents

Publication Publication Date Title
CN111309912B (en) Text classification method, apparatus, computer device and storage medium
CN106909654B (en) Multi-level classification system and method based on news text information
CN110348214B (en) Method and system for detecting malicious codes
CN106484401B (en) A kind of Automated Refactoring of object-oriented software
CN104462301A (en) Network data processing method and device
CN110880019A (en) Method for adaptively training target domain classification model through unsupervised domain
CN109683946B (en) User comment recommendation method based on code cloning technology
CN111310191A (en) Block chain intelligent contract vulnerability detection method based on deep learning
CN109684851A (en) Evaluation of Software Quality, device, equipment and computer storage medium
CN113221960B (en) Construction method and collection method of high-quality vulnerability data collection model
CN109740347A (en) A kind of identification of the fragile hash function for smart machine firmware and crack method
KR101520671B1 (en) System and method for analysis executable code based on similarity
CN109697361A (en) A kind of wooden horse classification method based on Trojan characteristics
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN106096413A (en) A kind of malicious code detecting method based on multi-feature fusion and system
CN111460452B (en) Android malicious software detection method based on frequency fingerprint extraction
CN110941829B (en) Large-scale hardware Trojan horse library generation system and method based on generation countermeasure network
CN113722711A (en) Data adding method based on big data security vulnerability mining and artificial intelligence system
CN106156107B (en) Method for discovering news hotspots
CN109858247A (en) A kind of Malware classification method of three characteristic model of static state based on XGBoost
CN114386511A (en) Malicious software family classification method based on multi-dimensional feature fusion and model integration
CN109816038A (en) A kind of Internet of Things firmware program classification method and its device
CN110825642B (en) Software code line-level defect detection method based on deep learning
JP2013003611A (en) Design verification method and program
CN115729825A (en) Fuzzy test case generation method and device of industrial protocol and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information

Address after: 210012 Jiangsu Province Yuhuatai District Software Avenue 168, 3 buildings, 5 floors

Applicant after: Bozhi Safety Technology Co.,Ltd.

Address before: 210012 Jiangsu Province Yuhuatai District Software Avenue 168, 3 buildings, 5 floors

Applicant before: JIANGSU BOZHI SOFTWARE TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20190607

WW01 Invention patent application withdrawn after publication