CN108287996A

CN108287996A - A kind of malicious code obscures feature cleaning method

Info

Publication number: CN108287996A
Application number: CN201810013584.4A
Authority: CN
Inventors: 王栎汉; 宁振虎; 薛菲; 蔡永泉; 梁鹏
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2018-01-08
Filing date: 2018-01-08
Publication date: 2018-07-17

Abstract

The invention discloses a kind of malicious codes to obscure feature cleaning method, belongs to machine learning information security field.This method includes feature selection approach and obscures feature cleaning method, improves the validity of traditional malicious code feature extracting method.Compared to traditional malicious code feature extracting method, the present invention can effectively extend the effective time limit of malicious code feature extraction algorithm, and improve the anti-interference of feature extraction algorithm.The present invention passes through n gram feature extracting method construction features library first.Since what this feature extraction algorithm can not solve malicious code obscures operation, cause to obscure characteristic value containing a large amount of malicious codes in feature database.Algorithm is cleaned by obscuring feature, interference of the abnormal data to Model Identification rule can be eliminated.On this basis from the angle of training dataset scale, a kind of feature selection approach is proposed.This method effectively reduces the number of features that model finally uses on the basis of ensureing that Model Identification precision does not decline.

Description

A kind of malicious code obscures feature cleaning method

Technical field

The present invention relates to a kind of malicious codes to obscure feature minimizing technology, can improve traditional malicious code feature extraction The effective time limit of method.Belong to machine learning information security field, be related to machine learning classification algorithm and obscure feature removal and The combination and use of feature selecting algorithm.

Background technology

It is counted according to Symantec, most of emerging malicious code is all on the basis of original malicious code, by one It is generated after a little map functions.Therefore usually Malicious Code Detection is feature based vector, this feature vectorial malice generation The substantive characteristics of code.Good feature extraction algorithm is the core technology of malicious code mutation detection.Common anti-viral software Malicious code is identified usually using the method based on signature.For giving one group of malicious code sample, first by maliciously generation Code is labeled as a family.It, should feature having the same for the malicious code of the same family.By these public spies Sign extracts, construction feature library, the mutation for detecting the malicious code family.But the detection in this feature based library System, safety depend on the validity of used feature extracting method.This is because new mutation malicious code can be directed to Previous feature extracting method is interfered, and then is achieved the purpose that around detecting system.Such as based on key-strings In malicious code detection system, malicious code passes through the addition to key-strings progress equivalencing or idle character string Escape the identification of detecting system.For malicious code to obscuring operation used by feature extraction, there are many scholars to propose Various different malicious code feature extracting methods obscure operation caused by detecting system to eliminate malicious code It influences, obtains best Malicious Code Detection effect.However on the one hand these feature extracting methods can be attacked gradually by malicious code Broken, on the other hand safer feature extracting method will also result in that computing resource expense is excessive, and system real time is poor etc. and ask Topic.

The research of Malicious Code Detection technology at this stage is concentrated mainly in the extraction to malicious code feature vector.For Improve the anti-interference of malicious code feature vector.Researcher is from security attribute, dependence, true semantic multiple angles Degree carries out feature extraction to malicious code.Kirda et al. utilizes spy's Code obtaining user's sensitive data, then by leak data Behavioural characteristic be detected.But this method is only limitted to the malicious code of detection spy's class, can not detect other and not made to data At the malicious code of leakage.Wang Rui et al. is based on malicious code from the practical semantic angle of malicious code by structure Semantic behavioural characteristic figure, be detected using this feature figure calculating characteristic value to malicious code achieve it is very good Detection result.But this method is the detection method based on program behavior itself, does not account for calling of the program for resource Problem, therefore the malicious code of certain special mutation can not be identified well.And this method timeliness is poor, needs Larger computing resource, does not have practicability.

And with the quickening of malicious code iteration speed, the timeliness of feature extracting method is also shorter and shorter.Pass through replacement Feature extracting method maintains the safety of system to become more and more difficult.Therefore how effectively to eliminate in feature database and obscure Feature, which becomes, the problem of very strong practical significance.

Invention content

Effective time limit in order to solve the problems, such as traditional malicious code feature extracting method is shorter, the skill that the present invention uses Art scheme is that one kind obscuring feature cleaning algorithm and feature selection approach, and it is that one kind being directed to n-gram to obscure feature cleaning algorithm Malicious code feature extraction algorithm, malicious code obscures feature minimizing technology.This obscures feature cleaning algorithm can be a small amount of Malicious code sample is analyzed, feature cleaning is carried out to remaining malicious code in sample database.Finally obtained feature database has feature Number is stablized, the features such as being not easily susceptible to obscure.Feature selection approach can then be replaced according to training data sample set, automation Change the characteristics of feature selected in feature database reaches optimization feature database.

A kind of malicious code obscures feature cleaning method, and this method includes feature selection approach and obscures feature cleaning side Method improves the validity of traditional malicious code feature extracting method.Compared to traditional malicious code feature extracting method, this hair It is bright effectively to extend the effective time limit of malicious code feature extraction algorithm, and improve the anti-interference of feature extraction algorithm.

The present invention passes through n-gram feature extracting method construction features library first.Since this feature extraction algorithm can not solve Certainly malicious code obscures operation, causes to obscure characteristic value containing a large amount of malicious codes in feature database.It is clear by obscuring feature Algorithm is washed, interference of the abnormal data to Model Identification rule can be eliminated.On this basis from the angle of training dataset scale On, propose a kind of feature selection approach.This method effectively reduces model on the basis of ensureing that Model Identification precision does not decline The number of features finally used.

The main technological route of the present invention is as follows：

1) it is based on multisample to analyze, structure obscures feature cleaning method.This method passes through to the detailed of a small amount of sample data The characteristics of analyzing, finding to obscure feature in sample simultaneously builds linear regression algorithm model.

2) it feature cleaning method dynamic is obscured based on this calculates in remaining each sample and obscure the threshold value of characteristic value, and be based on The value carries out obscuring removing to the feature vector of remaining sample in sample database.

3) training set construction feature selection method is inputted according to sample.This method first carries out obtained feature vector Normalized, and according to input training sample number, dynamic is removed and contributes smaller characteristic value in data set.

It is of the present invention obscure feature cleaning method specific implementation steps are as follows：

1) consider that malicious code sample situation is complicated, method of obscuring used by each malicious code sample is dynamic Variation, and the feature Distribution value that different samples are extracted is also different.Therefore it for each sample, needs to move State solves the size that sample obscures value.The threshold xi that characteristic value is obscured in each malicious code sample, referred to as obscures threshold value, and ξ is sample Obscure minimum value in characteristic value in this, which is dynamic change in different samples.In order to preferably weigh and characterize The size of the value.Following two indices are defined, are characterized desired value Feature respectively_averagesWith characteristic standard value Feature_median.The two indexs are as obtained from the dynamic solution to single sample, for describing in the sample Feature distribution situation.The function has reacted the relationship between threshold value and desired value and standard value：ξ=α * Feature_averages+β* Feature_median, α and β are characterized the weight of desired value and characteristic standard value respectively.

2) feature desired value Feature_averagesRepresent the ideal value condition of sample most original situation lower eigenvalue. By calculating the summation of each characteristic value and averaging in the sample, a characteristic value in current sample distribution is obtained Ideal values.In view of n-gram algorithms to most of malicious code sample when carrying out feature extraction, can cause in sample Containing largely only there is the invalid feature of single.Therefore feature desired value Feature is being calculated_medianWhen by sample After each characteristic value carries out duplicate removal, then carry out averaging operation.Such processing can eliminate shadow of a large amount of noise datas to mean value It rings.M is remaining Characteristic Number after duplicate removal, feature_iRepresent the characteristic value size of ith feature.

The calculating of feature desired value：

3) characteristic standard value Feature_medianObscure interference of the characteristic value to final result, feature for reducing larger Standard value is obtained by calculating in sample the median of all characteristic values, and the preferable sample that reacts is special when undisturbed The ideal values of value indicative.Since in a malicious code sample, whole characteristic value distribution situation tends to Gaussian Profile, In feature of obscuring considerably less ratio is only accounted in its feature distribution.Although characteristic standard value can also be made by obscuring characteristic value At influence.But it is relatively low due to obscuring characteristic value proportion in feature distribution, by solving the median in being distributed Value obtains the range that rear desired characteristics value value is obscured in a very close removal.M is remaining Characteristic Number after duplicate removal, feature_iRepresent the characteristic value size of ith feature.Mid functions are to solve for the median of sequence.Characteristic standard value calculates letter Number：

Feature_median=mid (feature₁, feature₂..., feature_m)。

When carrying out feature extraction to malicious code sample collection, obtained by preliminary using characteristic value cleaning method is obscured Feature database is obscured in removing for processing.The characteristic value of obscuring for generating larger interference in this feature library to training pattern has been cleared by, but It is if being directly based upon this feature library carries out model training, it is difficult to obtain good effect.It is concentrated and is existed due to malicious code sample The mutation malicious code of a variety of families can cause number of features in feature database excessively huge.In view of smaller in these characteristic values Feature in, remove most of noise data, also partly belong to family's feature important in malicious code.These families are special Only there is less number in sign, so if the smaller feature of characteristic value is all removed, inevitably removed part Good feature generates interference to the precision of model.In order to further obscure feature database to removing and clean, big portion is being eliminated Retain important malicious code family feature while the noise data divided.

A kind of feature selection approach based on input training dataset scale.The specific implementation technical solution of this method is such as Under：

1) due to the diversity of malicious code sample, the value range of feature vector is also different in each sample.It is right In the characteristic value of same numerical value, the significance level in different samples is different.In order to eliminate because of value range not Influence caused by when together, finally weighing feature.Method proposes a kind of normalizing operations based on accounting.For single Sample weighs each characteristic value in the sample important by calculating the ratio that each characteristic value and characteristic value are summed up in single sample Degree.feature_i' represent feature after standardization_iNew value.Characteristic standard algorithm：

2) for the training characteristics library after standardization, the sum of all characteristic values of single sample are 1.Therefore total for inputting Sample number S, the sum of all characteristic values are S.While in order to eliminate noise data in single sample, it is not destroyed In certain important family's features.Method proposes one kind being based on input sample number S, the same clan of malicious code man in training set Not Shuo n feature selection approach.For obscuring after each sampling feature vectors are standardized in feature database, then to all The feature of appearance adds up, and obtains each feature summation characteristic value based on sample set.Since malicious code family feature can be It can repeat in identical family's sample, therefore this feature value can improve the size of final characteristic value after cumulative.And for Remaining noise data, since its feature only occurs in individual samples, this feature value is 0 in remaining sample. In whole sample characteristics, shared ratio can also reduce final accumulated value accordingly.For some feature Feature_iValue It is by the sum of the value of this feature in all sample files.Wherein Feature_iIt is the value of final ith feature, S is instruction Practice collection number of samples, feature_iRepresent the value of current signature in each sample.

Feature Selection formula：

Description of the drawings

Fig. 1：Model general frame figure

Fig. 2：Disassembler segment

Fig. 3：Random forest tree algorithm model

Fig. 4：Malicious code sample test set

Fig. 5：F2 Experimental comparisons

Fig. 6：E1 Experimental comparisons

Specific implementation mode

The present invention is explained and illustrated with reference to relevant drawings：

To make the purpose of the present invention, technical solution and feature be more clearly understood, below in conjunction with specific embodiment, and With reference to attached drawing, further refinement explanation is carried out to the present invention.The method of the present invention general frame figure is as shown in Figure 1.Each step Process description is as follows：

(1) n-gram algorithms are based on and extract original malicious code feature, structure initial characteristics library.

(2) sample is extracted at random, carries out the research for obscuring threshold value.Training linear fit equation is for predicting unknown sample Obscure threshold value.

(3) obscured in feature cleaning method cleaning feature database based on this and obscure feature.

(4) feature database is standardized.

(5) it is based on this feature selection algorithm structure training characteristics library.

The method of the present invention running environment is as follows：

The hardware environment of operation is IBM servers (2 four cores of Intel to strong E54202.5 GHz/EM64T, 12MB L2 Caching), configuration 32G PC2-5300CL5ECC DDR2667MHz memories, 2 pieces of 500G hard disks；Operating system uses CentOS 7.064, background data base uses MongoDB 2.2.6.Malicious code mapping process, piecemeal, feature extraction, retrieval and mistake It filters journey and uses Python, correlation packet is Anaconda-1.8.0-Linux-x86_64, includes that engineering is related to The packet arrived.Wherein, MongoDB stores the relevant information of malicious code sample, such as MD5 values, file size, family's mark Note, malicious code PE block relevant informations etc..Used is the malicious code library provided by Microsoft.The library contains 9 The famous malicious code family of kind, every malicious code sample are owned by unique ID number.Each malicious code sample has only One classification value corresponds to each malicious code family respectively from 1 to 9, shares 10868 malicious code samples.Malicious code Test sample collection explanation such as Fig. 2.

Obscure feature cleaning method in the present invention, is the core methed for extending feature extracting method timeliness.Therefore it is The verification present invention obscures the validity of feature cleaning method, and the present invention selects random forest tree algorithm to obscure for detecting this The advantages of feature cleaning method.Since malicious code mutation situation is complicated, the feature database finally extracted also is difficult to completely All invalid features of removing.Therefore it needs to carry out the sampling put back to again to sample set, builds a variety of training sample sets for mould The mode of type training improves the generalization and diversity of data set.Improve anti-interference, the robustness of final detecting system.Choosing Select random forest tree algorithm and build multiple graders, final testing result in such a way that multiple graders are voted come It determines.Because what model took malicious code sample is to have the random sampling put back to, therefore even if sample size and original Data capacity is identical, the sample that still will appear repetition in sample set or miss.This sample mode for sample is made At otherness, training sample set is maximumlly utilized, improves generalization of the model to the following mutation malicious code.With Machine forest tree algorithm model such as Fig. 3.

Due to the characteristic that random forest tree is extracted at random, when carry out it is a large amount of repeat to test when, using obscure feature compared with The accuracy of detection fluctuation that more training characteristics libraries can cause its final is larger.For a Malicious Code Detection model, detection essence It is more unstable with the loophole that can be broken that degree fluctuates larger explanation model.Obscure value cleaning side to detect difference Influence of the case to final training characteristics, the present invention test the fluctuation situation of the precision between different schemes, F2, E1, E2 experimental precisions With number such as Fig. 4,5,6.Wherein F2 obscures cleaning program to be proposed by the present invention, remaining two kinds are respectively security fields research In common obscure feature sweep-out method.Obscure cleaning program such as table 1.

In order to weigh Fig. 4, each testing scheme is concentrated in different data in 5,6, the fluctuation situation of accuracy of detection.This The fluctuation situation of each testing scheme final mask precision is weighed in invention by standard deviation.Each experimental program standard deviation such as table 2, as shown in Table 2.F2 testing scheme entirety standard deviations are smaller, and with the growth of input sample collection, and standard deviation presents The decline of regularity.Compared to remaining testing scheme, F2 not only in 1000 input sample collection, just there is preferable stability. And when input number reaches 5000, F2 testing scheme model accuracy fluctuation ranges minimum is stablized the most.It is noted that Although when input sample reaches 2000, the standard deviation of E2 testing schemes is minimum.But one is to be due to this experimental calculation One observation, therefore have certain deviation.In addition E2 schemes input sample be 2000 when compared to 1000, standard deviation It substantially reduces.And when reaching 5000, standard deviation reduces standard deviation that is limited, and being less than in F2 testing schemes.Therefore the side F2 Case obscures value cleaning program for optimal.It solves to use failure characteristics extracting method because of model, causes to obscure feature in feature database Value accounting heavier the problem of being affected to model final detection result.Table 1

Testing scheme	Test sample collection	Feature selecting scheme	Cleaning condition	Obscure value cleaning method
					E1	1000、2000、5000	C1	feature_i＞ 300
E2	1000、2000、5000	C1	feature_i＞ 500	feature_i=0
					F1	1000、2000、5000	C1	feature_i＞ ξ	feature_i=ξ
F2	1000、2000、5000	C1	feature_i＞ ξ

Table 2

Claims

1. a kind of malicious code obscures feature cleaning method, this method includes feature selection approach and obscures feature cleaning method, Improve the validity of traditional malicious code feature extracting method；

Pass through n-gram feature extracting method construction features library first；Since this feature extraction algorithm can not solve malicious code Obscure operation, causes to obscure characteristic value containing a large amount of malicious codes in feature database；Algorithm is cleaned by obscuring feature, is eliminated different Interference of the regular data to Model Identification rule；On this basis from the angle of training dataset scale, a kind of feature choosing is proposed Selection method；This method effectively reduces the number of features that model finally uses on the basis of ensureing that Model Identification precision does not decline；

It is characterized in that：The implementing procedure of this method is as follows,

1) it is based on multisample to analyze, structure obscures feature cleaning method；This method by the detailed analysis to a small amount of sample data, It was found that the characteristics of obscuring feature in sample and building linear regression algorithm model；

2) it feature cleaning method dynamic is obscured based on this calculates and obscure the threshold value of characteristic value in remaining each sample, and be based on the value pair The feature vector of remaining sample carries out obscuring removing in sample database；

3) training set construction feature selection method is inputted according to sample；This method is first normalized obtained feature vector Processing, and according to input training sample number, dynamic is removed and contributes smaller characteristic value in data set；

Steps are as follows for specific implementation：

1) consider that malicious code sample situation is complicated, it is dynamic change to obscure method used by each malicious code sample , and the feature Distribution value that different samples are extracted is also different；Therefore for each sample, dynamic solution is needed Sample obscures the size of value；The threshold xi that characteristic value is obscured in each malicious code sample, referred to as obscures threshold value, and ξ is obscured in sample Minimum value in characteristic value, the minimum value are dynamic changes in different samples；In order to preferably weigh and characterize the big of the value It is small；Following two indices are defined, are characterized desired value Feature respectively_averagesWith characteristic standard value Feature_median；This Two indices are as obtained from the dynamic solution to single sample, for describing the feature distribution situation in the sample；It should Function has reacted the relationship between threshold value and desired value and standard value：ξ=α * Feature_averages+β*Feature_median, α and β It is characterized the weight of desired value and characteristic standard value respectively；

2) feature desired value Feature_averagesRepresent the ideal value condition of sample most original situation lower eigenvalue；Pass through meter The summation of each characteristic value and averaging in the sample are calculated, the ideal for obtaining a characteristic value in current sample distribution takes Value；In view of n-gram algorithms to most of malicious code sample when carrying out feature extraction, can cause in sample containing a large amount of Only there is the invalid feature of single；Therefore feature desired value Feature is being calculated_medianWhen by each characteristic value in sample After carrying out duplicate removal, then carry out averaging operation；Such processing can eliminate influence of a large amount of noise datas to mean value；M is duplicate removal Remaining Characteristic Number afterwards, feature_iRepresent the characteristic value size of ith feature；

The calculating of feature desired value：

3) characteristic standard value Feature_medianObscure interference of the characteristic value to final result, characteristic standard for reducing larger Value is obtained by calculating in sample the median of all characteristic values, and the preferable sample that reacts is when undisturbed, characteristic value Ideal values；Since in a malicious code sample, whole characteristic value distribution situation tends to Gaussian Profile, therein mixed Feature of confusing only accounts for considerably less ratio in its feature distribution；Although obscure characteristic value will also result in influence to characteristic standard value； But it is relatively low due to obscuring characteristic value proportion in feature distribution, by solving the median value in being distributed, obtain The range of rear desired characteristics value value is obscured to a very close removal；M is remaining Characteristic Number after duplicate removal, feature_iGeneration The characteristic value size of table ith feature；Mid functions are to solve for the median of sequence；Characteristic standard value calculates function：

Feature_median=mid (feature₁, feature₂..., feature_m)。

2. a kind of malicious code according to claim 1 obscures feature cleaning method, it is characterised in that：To malicious code When sample set carries out feature extraction, obtain obscuring feature database by removing for preliminary treatment using characteristic value cleaning method is obscured；It should The characteristic value of obscuring for generating larger interference in feature database to training pattern has been cleared by, but if is directly based upon this feature library Model training is carried out, it is difficult to obtain good effect；Since malicious code sample is concentrated there are the mutation malicious code of a variety of families, Number of features in feature database can be caused excessively huge；In view of in the smaller feature of these characteristic values, removing most of noise Data also partly belong to family's feature important in malicious code；Only there is less number in these family's features, therefore If the smaller feature of characteristic value all removed, it is dry to the precision generation of model inevitably to remove the good feature in part It disturbs；For that further can obscure feature database to removing and clean, retain while eliminating most noise data important Malicious code family feature；

Realize that specific technical solution is as follows using a kind of feature selection approach based on input training dataset scale：

1) due to the diversity of malicious code sample, the value range of feature vector is also different in each sample；For same The characteristic value of one numerical value, the significance level in different samples are different；In order to eliminate because of value range difference, to most It is influenced caused by when weighing feature eventually；Method proposes a kind of normalizing operations based on accounting；For single sample, pass through The ratio for calculating each characteristic value and characteristic value sum total in single sample, weighs the significance level of each characteristic value in the sample； feature_i' represent feature after standardization_iNew value；Characteristic standard algorithm：

2) for the training characteristics library after standardization, the sum of all characteristic values of single sample are 1；Therefore for inputting total number of samples S, the sum of all characteristic values are S；While in order to eliminate noise data in single sample, some of which weight is not destroyed The family's feature wanted；Method proposes one kind being based on input sample number S, the spy of malicious code family classification number n in training set Levy selection method；For obscuring after each sampling feature vectors are standardized in feature database, then the feature to being occurred It adds up, obtains each feature summation characteristic value based on sample set；Since malicious code family feature can be in identical family's sample It can repeat in this, therefore this feature value can improve the size of final characteristic value after cumulative；And for remaining noise Data, since its feature only occurs in individual samples, this feature value is 0 in remaining sample；Final accumulated value In whole sample characteristics, shared ratio can also reduce accordingly；For some feature Feature_iValue be by all samples The sum of the value of this feature in this document；Wherein Feature_iIt is the value of final ith feature, S is training set number of samples, feature_iRepresent the value of current signature in each sample；

Feature Selection formula：