CN109784046A

CN109784046A - A kind of malware detection method, apparatus and electronic equipment

Info

Publication number: CN109784046A
Application number: CN201811495637.7A
Authority: CN
Inventors: 胡一博; 朱诗兵; 李长青; 帅海峰; 吕登龙; 徐华正; 张记瑞
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2019-05-21
Anticipated expiration: 2038-12-07
Also published as: CN109784046B

Abstract

The invention discloses a kind of malware detection method, apparatus and electronic equipments, are related to the safety protection field of mobile terminal, can effectively detect Malware, and overcoming the prior art and extracting in characteristic of malware has redundancy, uncorrelated and noise.The malware detection method includes: feature extraction；Subset generates；Generate detection model.The malware detection device includes: characteristic extracting module, subset generation module and detection model generation module.The electronic equipment including memory, processor and stores the computer program that can be run on a memory and on a processor, and the processor realizes the malware detection method when executing described program.

Description

A kind of malware detection method, apparatus and electronic equipment

Technical field

The present invention relates to the security protection of mobile terminal, a kind of malware detection method, apparatus and electronics are particularly related to Equipment.

Background technique

Mobile intelligent terminal refers to network accessibility, all kinds of movable types equipped with operating system and application program eventually The general name at end.A large amount of universal and powerful functions of mobile intelligent terminal headed by android system become modern society The tool that each field of meeting can not be substituted.At the same time, the Malware to accompany together in mobile intelligent terminal is also gradually rampant, dislikes Software anticipate in the state of not detectable, destroys custom system, steals user data and rate, seriously threaten the privacy of user And property safety.More seriously, the related classified information such as national economy, politics, military affairs also receives threat, to national security Cause harm.In order to cope with the increasingly increased malware threats of mobile intelligent terminal, meet the following mobile intelligent terminal pair The testing requirements of unknown malware need a kind of detection method for Android malware.

The existing detection method using machine learning is from the angle of artificial intelligence, using sorting algorithm to known evil The feature of meaning software is learnt, construct continuous evolution and extensive intelligent monitoring model with realize to Android software from Dynamicization intelligent measurement.The key of this detection method is the selection of feature, the feature of selection more can effectively distinguish Malware and The efficiency of normal software, then the intelligent measurement model obtained using machine learning classification algorithm is higher, the detection to Malware Effect is better.However, there are redundancies, uncorrelated and noise problem for the feature for the Malware that existing method extracts: feature Redundancy influences the computational efficiency of sorting algorithm, reduces the validity of detection model；The irrelevance of feature results in the need for more Training sample can just obtain suitable detection model；The noise jamming of feature can directly result in the detection model that building makes mistake. The above problem can greatly increase the consumption of machine learning over time and space, carry out so as to cause sorting algorithm to feature It is entirely ineffective because of cost prohibitive when analysis processing.

Summary of the invention

In view of this, it is an object of the invention to propose a kind of to meet mobile terminal and want the detection of unknown malware It asks, while can solve redundancy, uncorrelated and noise problem detection method, device and electronic equipment that Feature Selection is faced.

Based on above-mentioned purpose, the present invention provides a kind of malware detection methods.The malware detection method packet It includes:

The characteristic information is abstracted and turns to digital form by the characteristic information for extracting sample set software, obtains sample set spy Collection is closed and sample set eigenmatrix；

The invalid feature in the characteristic set is filtered using feature selecting algorithm, obtains optimal feature subset；

The corresponding eigenmatrix of the optimal feature subset is trained using machine learning classification algorithm, generates detection Model.

Optionally, the characteristic information for extracting sample set software, the characteristic information is abstracted and turns to digital form, is obtained Include: to sample set characteristic set and sample set eigenmatrix

Sample set software installation packet is handled, the global configuration file comprising authority information is obtained and is believed comprising API The decompiling file of breath；

Corresponding permission and API characteristic information are extracted from the global configuration file and the decompiling file；

The permission extracted and API characteristic information vectorization are abstracted as digital form, obtain sample set feature set It closes and sample set eigenmatrix.

The optional invalid feature filtered in the sample set characteristic set using feature selecting algorithm is obtained best Character subset includes:

Step 1: to relevant parameter constant in the sample set characteristic set and the sample set eigenmatrix and in son The relevant parameter used in collection generating process carries out Initialize installation；

Step 2: calculating the characteristic frequency of each feature in sample set characteristic set according to characteristic frequency calculation formula, leads to It crosses calculating and compares the incoherent feature of elimination, obtain uncorrelated features subset；

The characteristic frequency calculation formula:

Wherein, TF (f^j) indicate feature f^jCharacteristic frequency, N_benignIndicate that normal software concentrates normal sample number,Indicate feature f^jThe sample number of appearance；N_malwareIndicate malice sample number in malice sample set,For spy Levy f^jThe sample number of appearance；

Step 3: the information of each feature in uncorrelated features subset is gone to increase according to the calculating of information gain calculation formula Benefit compares screening by calculating and obtains denoising character subset；

The information gain calculation formula:

IG(f^j)=H (Y)-H (Y | f^j)

Wherein, IG (f^j) indicate feature f^jTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | f^j) The conditional entropy of presentation class system；

Step 4: according to χ²Statistical value calculation formula, calculate in the denoising character subset each feature with it is corresponding CHI value (the χ of eigenmatrix²Statistical value) and feature between CHI value, compare screening by calculating and obtain de-redundancy character subset；

The χ²Statistical value calculation formula:

CHI(fⁱ, f^j)=ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

Wherein, CHI (fⁱ, f^j) indicate feature fⁱ, f^jχ²Statistical value, ξ₁₁Indicate feature fⁱWith feature f^jThe reason occurred simultaneously By the deviation of value and actual value, ξ₁₂Indicate feature fⁱDo not occur feature f in the sample of appearance^jTheoretical value and actual value it is inclined Difference, ξ₂₁Expression does not occur feature fⁱSample number in there is feature f^jTheoretical value and actual value deviation, ξ₂₂Expression does not occur Feature fⁱSample number in do not occur feature f yet^jTheoretical value and actual value deviation；

Step 5: analyzing and determining the de-redundancy character subset, and according to result progress subset optimization is determined, obtains To optimal feature subset.

Optionally, the step 1 specifically:

Remember that the sample set characteristic set is F_v, the sample set eigenmatrix is X_train, the Characteristic Number selected is M_v； The initial threshold of setting information gain is a certain particular value θ_ig, setting information gain step size is λ, and setting information gain recycles step number Initial value n=0 sets verification and measurement ratio threshold value as 0.95；Using machine learning classification algorithm to the sample set eigenmatrix X_train It is trained, writing down maximum verification and measurement ratio is TP_max。

Optionally, the step 2 includes:

Step 1: calculating the characteristic frequency of each feature in the sample set characteristic set；

Step 2: filtering out the feature that characteristic frequency value is 0, remaining feature forms intermediate features subset

Step 3: by machine learning classification algorithm to the intermediate features subsetCorresponding eigenmatrix is instructed Practice, obtains corresponding verification and measurement ratio TP_tf；

Step 4: filtering out the intermediate features subsetThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic CollectionBy machine learning classification algorithm to the character subsetCorresponding eigenmatrix is trained, and is obtained corresponding Verification and measurement ratio TP_tf′；

Step 5: comparing TP_tfWith TP_tf' value, if TP_tf=TP_tf', then by the character subsetIt is denoted as in new Between character subsetReturn step 3；If TP_tf≠TP_tf', export the intermediate features subset

Step 6: by the intermediate features subsetIt is denoted as character subset F_v1, the Characteristic Number selected is M_v1, the spy Levy subset F_v1It is i.e. described to go uncorrelated features subset.

Optionally, the step 3 includes:

Step 1: the information gain of each feature in uncorrelated features subset is gone described in calculating；

Step 2: circulation step number adds 1, i.e. n=n+1 on the original basis；

Step 3: selecting and meet IG > (θ_ig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is remembered ForIt selects and meets IG > (θ_ig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as

Step 4: comparingWithValue, ifReturn step 2；IfIt is defeated Character subset out

Step 5: by the character subset of outputIt is denoted as character subset F_v2, the Characteristic Number selected is M_v2, described Character subset F_v2The i.e. described denoising character subset.

Optionally, the step 4 includes:

Step 1: calculating the denoising character subset F_v2In each feature and corresponding eigenmatrix CHI value, by it In maximum CHI value be denoted as θ_chi；

Step 2: calculating the CHI value between feature, CHI value is greater than θ between selecting feature_chiFeature pair, and will wherein IG value compared with Small feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for

Step 3: setting circulation step number m=0；

Step 4: circulation step number adds 1, i.e. m=m+1 on the original basis；

Step 5: according to the redundancy feature collectionF is rejected in putting in order for middle redundancy feature_v2In a redundancy spy Sign, obtains character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subset Corresponding eigenmatrix is trained, and obtains corresponding verification and measurement ratio

Step 6: compare m withValue, ifReturn step 4；Otherwise it performs the next step；

Step 7: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum detection RateCorresponding character subset is denoted as F_v3, the Characteristic Number selected is M_v3, the character subset F_v3That is de-redundancy feature Subset.

Optionally, the step 5 specifically:

By verification and measurement ratio described in the step 4With the maximum verification and measurement ratio TP_maxIt is compared, it will be in the two The larger value is assigned to TP_max；By TP_maxCompared with initially setting verification and measurement ratio threshold value 0.95, if TP_max< 0.95, then return to the step Rapid three；If TP_max>=0.95, then TP_maxCorresponding character subset is optimal feature subset, and note optimal feature subset is F_v。

Optionally, the generation detection model method specifically:

Verification and measurement ratio threshold value is set, bayesian algorithm, algorithm of support vector machine, decision Tree algorithms and arest neighbors point are utilized respectively Class algorithm is trained the corresponding eigenmatrix of optimal feature subset, selects optimal detection according to the verification and measurement ratio threshold value of setting Model output.

It is described that optimal detection model output method is selected according to the verification and measurement ratio threshold value of setting specifically:

If the verification and measurement ratio of the detection model as obtained by training is not less than threshold value, output phase answers detection model；

If the verification and measurement ratio of the detection model as obtained by training is lower than threshold value, changes the combination of feature, instruct again New detection model is got, until meeting threshold requirement, output meets the detection model of threshold requirement；

If traversing all possible feature combination, gained detection model verification and measurement ratio is not able to satisfy threshold requirement yet, Then export the highest detection model of verification and measurement ratio in ergodic process.

The present invention also provides a kind of malware detection device, described device includes:

Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information is abstracted and turns to number Form obtains sample set characteristic set and sample set eigenmatrix；

Subset generation module: for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm, Obtain optimal feature subset；

Detection model generation module: for using machine learning classification algorithm to the corresponding feature of the optimal feature subset Matrix is trained, and generates detection model.

The present invention also provides a kind of malware detection electronic equipment, including memory, processor and it is stored in storage On device and the computer program that can run on a processor, the processor are realized provided by the present invention when executing described program Malware detection method.

From the above it can be seen that a kind of malware detection method, apparatus provided by the invention and electronic equipment are logical The permission and sensitive API feature for extracting Malware and normal software are crossed, the feature obtained using feature selecting algorithm to extraction Make optimum choice, and permission and sensitive API assemblage characteristic using the training of machine learning classification algorithm Jing Guo selection, Ke Yiyou Effect detects Malware.Used feature selecting algorithm is based on feature frequency, information gain and χ²Statistics design: spy is utilized The method filtering characteristic for levying frequency, which is concentrated, out influences classification with incoherent feature of classifying, the method choice of use information gain Big feature, using χ²The method of statistics rejects the feature that redundancy is high in feature set.Therefore Malware provided by the invention It is redundancy present in the characteristic of malware that detection method can overcome well art methods to extract, uncorrelated and make an uproar The problem of sound.The feature selecting algorithm is by feature frequency, information gain and χ²These three methods are counted according to preferred Particular order combines, compared to by these three method simple combinations or select one or secondly simple combination have preferably it is excellent Change selection effect, be trained by the character subset obtained using machine learning classification algorithm to the feature selecting algorithm, Finally obtained detection model it is more efficient, it is more preferable to the detection effect of Malware.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is the malware detection method schematic diagram in the embodiment of the present invention；

Fig. 2 is the malware detection method flow block diagram in the embodiment of the present invention；

Fig. 3 is feature extracting method flow diagram in the malware detection method in the embodiment of the present invention；

Fig. 4 is the malware detection method neutron set creation method schematic diagram figure in the embodiment of the present invention；

Fig. 5 removes uncorrelated method flow frame for characteristic frequency in the malware detection method in the embodiment of the present invention Figure；

Fig. 6 is information gain denoising method flow block diagram in the malware detection method in the embodiment of the present invention；

Fig. 7 is χ in the malware detection method in the embodiment of the present invention²Count de-redundancy method flow block diagram；

Fig. 8 is generation detection model method schematic diagram in the malware detection method in the embodiment of the present invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.

One aspect of the present invention provides a kind of malware detection method.

As shown in figures 1 and 2, in a kind of some embodiments of malware detection method provided by the invention, the evil Meaning software detecting method specifically includes:

S101: feature extraction.Using reverse-engineering to software sample decompiling, characteristic information is extracted, characteristic information is taken out As the sample set characteristic set of the digital form to be easy to analyze is stored in database with sample set eigenmatrix；

S102: subset generates.The invalid feature in the sample set characteristic set is filtered using feature selecting algorithm, is obtained Optimal feature subset；

S103: detection model generates.Using machine learning classification algorithm to the corresponding feature square of the optimal feature subset Battle array is trained, and generates bicharacteristic detection model.

As shown in figure 3, in a kind of other embodiments of malware detection method provided by the invention, the feature Extracting method specifically includes:

S301: handling software APK packet, by the AndroidManifest.xml file decoding comprising authority information For the global configuration file of clear text format, the classes.dex file reverse comprising API information is compiled as .smali file；

S302: corresponding permission and API characteristic information are extracted from global configuration file and .smali file；

S303: the Feature Semantics information vector extracted is abstracted as digital form, obtains sample set characteristic set With sample set eigenmatrix.The sample set eigenmatrix specifically: indicate that certain feature appears in the sample with 1, use 0 indicates This feature does not appear in the sample, the feature of sample set is finally described with a binary sample collection eigenmatrix, wherein row table Show that sample vector, column indicate feature vector.

As shown in figure 4, in a kind of other embodiments of malware detection method provided by the invention, the subset Generation method specifically includes:

S401 step 1: constant initialization: to the sample set characteristic set to related the sample set eigenmatrix Constants and the relevant parameter used in subset generating process carry out Initialize installation.

S402 step 2: characteristic frequency is gone uncorrelated: calculating the sample set feature set according to characteristic frequency calculation formula The characteristic frequency of each feature in conjunction.Incoherent feature is filtered off by calculating to compare, obtains uncorrelated features subset；

The characteristic frequency calculation formula:

Wherein, TF (f^j) indicate feature f^jCharacteristic frequency.N_benignIndicate that normal software concentrates normal sample number,Indicate feature f^jThe sample number of appearance；N_malwareIndicate malice sample number in malice sample set,For spy Levy f^jThe sample number of appearance.

S403 step 3: information gain denoising: go uncorrelated features according to the calculating of information gain calculation formula The information gain for concentrating each feature compares screening by calculating and obtains denoising character subset；

The information gain calculation formula:

IG(f^j)=H (Y)-H (Y | f^j)

Wherein, IG (f^j) indicate feature f^jTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | f^j) The conditional entropy of presentation class system.

Information gain calculation formula specific explanations are as follows:

The probability that normal software sample occurs is P (c₀), the probability that Malware sample occurs is P (c₁), then classification system The entropy of system is defined as:

Given feature f^jConditional probability P (c of all categories when appearance_i|f^j=1), the then conditional entropy of categorizing system is defined as:

So, feature f^jWhen not occurring, the entropy of categorizing system is defined as:

Wherein, probability P (c_i) value be c_iThe specific gravity of the total number of training of classification sample number Zhan；Probability P (f^j=1) value It is feature f occur^jSample number account for the ratio of total number of samples, probability P (f^j=0) value is feature f^jThe sample number not occurred accounts for The ratio of total number of samples.

Feature f as a result,^jTo the information gain IG (f of categorizing system^j) calculation formula it is as follows:

S404 step 4: χ²Count de-redundancy: according to χ²Statistical value calculation formula calculates in the denoising character subset CHI value (the χ of each feature and corresponding eigenmatrix²Statistical value) and feature between CHI value.Compare screening by calculating to obtain De-redundancy character subset；

Two feature fⁱ, f^jχ²Statistical value calculation formula:

CHI(fⁱ, f^j)=ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

Wherein, CHI (fⁱ, f^j) indicate feature fⁱ, f^jχ²Statistical value, ξ₁₁Indicate feature fⁱWith feature f^jThe reason occurred simultaneously By the deviation of value and actual value；ξ₁₂Indicate feature fⁱDo not occur feature f in the sample of appearance^jTheoretical value and actual value it is inclined Difference；ξ₂₁Expression does not occur feature fⁱSample number in there is feature f^jTheoretical value and actual value deviation；ξ₂₂Expression does not occur Feature fⁱSample number in do not occur feature f yet^jTheoretical value and actual value deviation.

χ²Statistical value calculation formula specific explanations are as follows:

χ²Statistics is using the deviation of actual value and theoretical value come the degree of correlation between measures characteristic and classification.Assuming that two A feature fⁱAnd f^j, two features simultaneously occur sample number beWhile the sample number not occurred isFeature fⁱOccur And f^jThe sample number not occurred isFeature fⁱDo not occur and f^jThe sample number of appearance isPhysical relationship between them It is as shown in table 1:

1 feature distribution table of table

Wherein, N is total number of samples, and value is the sum of four kinds of situations, i.e., By This, can obtain, feature fⁱThe frequency of appearance are as follows:

Feature f^jThe sample number of appearance isTheoretically there is feature f^jSample in, also there is feature fⁱ's Sample number are as follows:

So, feature fⁱAnd f^jSimultaneously occur to theoretical value and actual value deviation ξ₁₁Are as follows:

Similarly, feature f can be acquiredⁱOccur and feature f^jThe theoretical sample number E not occurred₁₂, feature fⁱDo not occur and feature f^j The theoretical sample number E of appearance₂₁, feature fⁱWith feature f^jThe theoretical sample number E all not occurred₂₂And their theoretical value and reality The deviation ξ of actual value₁₂、ξ₂₁、ξ₂₂, calculation formula is as follows:

Therefore, two feature fⁱAnd f^jχ²Statistical value is deviation ξ₁₁、ξ₁₂、ξ₂₁、ξ₂₂The sum of, i.e.,

S405 step 5: it generates optimal feature subset: the de-redundancy character subset is analyzed and determined, and according to sentencing Determine result and carry out further operating, obtains optimal feature subset to the end.

Wherein, the step 1 specifically:

Note sample set eigenmatrix is X_train, the feature set selected is F_v, the Characteristic Number selected is M_v.Setting information increases The initial threshold of benefit is a certain particular value θ_ig, setting information gain step size is λ, and setting information gain recycles step number n=0, setting Verification and measurement ratio threshold value is 0.95.Using machine learning classification algorithm to primitive character matrix X_trainIt is trained, writes down maximum detection Rate is TP_max。

As shown in figure 5, the step 2 specifically includes:

S501: the characteristic frequency of each feature in all sample set characteristic sets is calculated；

S502: filtering out the feature that characteristic frequency value is 0, and remaining feature forms intermediate features subset

S503: by machine learning classification algorithm to intermediate character subsetCorresponding eigenmatrix is trained, and is obtained Corresponding verification and measurement ratio TP_tf；

S504: intermediate features subset is filtered outThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic subsetBy machine learning classification algorithm to character subsetCorresponding eigenmatrix is trained, and is detected accordingly Rate TP_tf′；

S505: compare TP_tfWith TP_tf' value, if TP_tf=TP_tf', then by character subsetIt is denoted as new intermediate spy Levy subsetReturn step S503；If TP_tf≠TP_tf', export intermediate features subset

S506: the character subset of output is denoted as F_v1, the Characteristic Number selected is M_v1, the character subset F_v1It is i.e. described Go uncorrelated features subset.

As shown in fig. 6, the step 3 specifically includes:

S601: the information gain of each feature in uncorrelated features subset is gone described in calculating；

S602: circulation step number adds 1, i.e. n=n+1 on the original basis；

S603: it selects and meets IG > (θ_ig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is remembered ForIt selects and meets IG > (θ_ig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as

S604: compareWithValue, ifReturn step S602；If Export character subset

S605: the character subset of output is denoted as F_v2, the Characteristic Number selected is M_v2, the character subset F_v2It is i.e. described Denoising character subset.

As shown in fig. 7, the step 4 specifically includes:

S701: character subset F is calculated_v2In each feature and corresponding eigenmatrix CHI value (χ²Statistical value), it will wherein Maximum CHI value is denoted as θ_chi；

S702: calculating the CHI value between feature, and CHI value is greater than θ between selecting feature_chiFeature pair, and will wherein IG value compared with Small feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for

S703: setting circulation step number m=0；

S704: circulation step number adds 1, i.e. m=m+1 on the original basis；

S705: according to redundancy feature collectionF is rejected in putting in order for middle redundancy feature_v2In a redundancy feature, obtain To character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subsetIt is corresponding Eigenmatrix is trained, and obtains corresponding verification and measurement ratio

S706: compare m withValue, ifReturn step S704；Otherwise it performs the next step；

S707: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum verification and measurement ratioCorresponding character subset is denoted as F_v3, the Characteristic Number selected is M_v3, the character subset F_v3That is de-redundancy feature Collection.

Wherein, the step 5 generates optimal feature subset method specifically:

By gained verification and measurement ratio in the step 4With maximum verification and measurement ratio TP_maxIt is compared, it will be larger in the two Value is assigned to TP_max；By TP_maxCompared with initially setting verification and measurement ratio threshold value 0.95, if TP_max< 0.95, then return step three；If TP_max>=0.95, then TP_maxCorresponding character subset, that is, optimal feature subset, is denoted as F for it_v。

As shown in figure 8, in a kind of other embodiments of malware detection method provided by the invention, the generation Detection model method specifically:

Verification and measurement ratio threshold value is set first, is then utilized respectively bayesian algorithm (NB), algorithm of support vector machine (SVM), determines Plan tree algorithm (DT) and arest neighbors sorting algorithm (KNN) are trained permission and sensitive API feature, according to the verification and measurement ratio of setting Threshold value selects optimal detection model output.

As shown in figure 8, in a kind of other embodiments of malware detection method provided by the invention, the basis The method that the detection threshold value of setting selects optimal detection model output specifically:

As shown in table 2, be in a kind of embodiment of malware detection method provided by the invention to detection performance into The result of row test.

2 detection performance result of table

Specific implementation method are as follows:

Using peace intelligence in the market 5000 normal softwares through detecting and upper 5000 Malwares of VirusShare as Sample set, is tested that (it is similar that sample data is divided into 10 sizes by 10 folding cross validations using 10 folding cross-validation methods Exclusive subsets, use the union of 9 subsets as training set every time, that subset of remainder is as test set, thus progress 10 Secondary training and test, finally obtained is the mean value of this 10 test results).

Feature selecting algorithm to unused feature selecting algorithm and based on characteristic frequency, information gain and statistics respectively Detection performance compares and analyzes.

Wherein, the meaning of performance indicator is as follows

(1) TPR (verification and measurement ratio) is the ratio of classifier final classification correct positive example and practical positive example, and TPR is bigger, shows Classifier is better to positive example classifying quality.Calculation formula is as follows:

(2) FPR (rate of false alarm) is the positive example of classifier final classification mistake and the ratio of practical counter-example, and FPR is bigger, shows Classifier is poorer to counter-example classifying quality.Calculation formula is as follows:

(3) Acc (accuracy rate) is the ratio of classifier final all classification correct samples and total sample, presentation class device Accurate classification degree, Acc is bigger, shows that the whole classification capacity of the classifier is better.Calculation formula is as follows:

In formula, TP (real example) is the number that the sample that truth is positive is detected as positive example, i.e. detection is correct just Example；It is the number that anti-sample is detected as positive example that FP (false positive example), which is truth, that is, detects the counter-example of mistake；FN is (false anti- Example) it is that the sample that truth is positive is detected as the number of counter-example, the i.e. positive example of classification error；TN (true counter-example) is true feelings Condition is the number that anti-sample is detected as counter-example, that is, detects correct counter-example.

It can be seen that malware detection method proposed by the present invention according to testing inspection result and use feature selecting algorithm After software features are in optimized selection, software features are trained using different machine learning classification algorithms in four and are obtained The verification and measurement ratio of detection model, accuracy rate is above the malware detection method of traditional unused feature selecting algorithm and obtains The verification and measurement ratio of detection model, accuracy rate, and the rate of false alarm for the detection model that malware detection method proposed by the present invention obtains It is below the rate of false alarm for the detection model that traditional malware detection method obtains.

This illustrates that a kind of detection efficiency of malware detection method proposed by the present invention is higher, and detection effect is more preferable.

Another aspect of the present invention provides a kind of malware detection device.

In a kind of some embodiments of malware detection device provided by the invention, described device includes:

Another aspect of the present invention provides a kind of malware detection electronic equipment.

The present invention provides in some embodiments of malware detection electronic equipment, the electronic equipment includes:

Memory, processor and storage are on a memory and the computer program that can run on a processor.

The processor realizes malware detection method provided by the present invention when executing described program.

The device of above-described embodiment, for realizing method corresponding in previous embodiment, and has corresponding with electronic equipment Embodiment of the method beneficial effect, details are not described herein.

It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples；Under thinking of the invention, above embodiments Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.

In addition, to simplify explanation and discussing, and in order not to obscure the invention, it can in provided attached drawing It is connect with showing or can not show with the well known power ground of integrated circuit (IC) chip and other components.Furthermore, it is possible to Device is shown in block diagram form, to avoid obscuring the invention, and this has also contemplated following facts, i.e., about this The details of the embodiment of a little block diagram arrangements be height depend on will implementing platform of the invention (that is, these details should It is completely within the scope of the understanding of those skilled in the art).Elaborating that detail (for example, circuit) is of the invention to describe In the case where exemplary embodiment, it will be apparent to those skilled in the art that can be in these no details In the case where or implement the present invention in the case that these details change.Therefore, these descriptions should be considered as explanation Property rather than it is restrictive.

Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).

The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims, Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made Deng should all be included in the protection scope of the present invention.

Claims

1. a kind of malware detection method characterized by comprising

The characteristic information is abstracted and turns to digital form, obtains sample set feature set by the characteristic information for extracting sample set software It closes and sample set eigenmatrix；

The corresponding eigenmatrix of the optimal feature subset is trained using machine learning classification algorithm, generates detection mould Type.

2., will be described the method according to claim 1, wherein the characteristic information for extracting sample set software Characteristic information is abstract to turn to digital form, and obtaining sample set characteristic set with sample set eigenmatrix includes:

Sample set software installation packet is handled, obtains the global configuration file comprising authority information and comprising API information Decompiling file；

The permission extracted and API characteristic information vectorization are abstracted as digital form, obtain sample set characteristic set with Sample set eigenmatrix.

3. the method according to claim 1, wherein described filter the sample set spy using feature selecting algorithm Invalid feature in collection conjunction, obtaining optimal feature subset includes:

Step 1: to relevant parameter constant in the sample set characteristic set and the sample set eigenmatrix and in subset life Initialize installation is carried out at the relevant parameter used in the process；

Step 2: the characteristic frequency of each feature in sample set characteristic set is calculated according to characteristic frequency calculation formula, passes through meter Calculation, which is compared, filters off incoherent feature, obtains uncorrelated features subset；

The characteristic frequency calculation formula:

Wherein, TF (f^j) indicate feature f^jCharacteristic frequency, N_benignIndicate that normal software concentrates normal sample number,Table Show feature f^jThe sample number of appearance；N_malwareIndicate malice sample number in malice sample set,It is characterized f^jOccur Sample number；

Step 3: going the information gain of each feature in uncorrelated features subset according to the calculating of information gain calculation formula, Compare screening by calculating and obtains denoising character subset；

The information gain calculation formula:

IG(f^j)=H (Y)-H (Y | f^j)

Wherein, IG (f^j) indicate feature f^jTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | f^j) indicate The conditional entropy of categorizing system；

Step 4: according to χ²Statistical value calculation formula calculates each feature and corresponding feature square in the denoising character subset CHI value (the χ of battle array²Statistical value) and feature between CHI value, compare screening by calculating and obtain de-redundancy character subset；

The χ²Statistical value calculation formula:

CHI(fⁱ, f^j)=ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

Wherein, CHI (fⁱ, f^j) indicate feature fⁱ, f^jχ²Statistical value, ξ₁₁Indicate feature fⁱWith feature f^jThe theoretical value occurred simultaneously And the deviation of actual value, ξ₁₂Indicate feature fⁱDo not occur feature f in the sample of appearance^jTheoretical value and actual value deviation, ξ₂₁ Expression does not occur feature fⁱSample number in there is feature f^jTheoretical value and actual value deviation, ξ₂₂Expression does not occur feature fⁱ Sample number in do not occur feature f yet^jTheoretical value and actual value deviation；

Step 5: analyzing and determining the de-redundancy character subset, and according to result progress subset optimization is determined, obtains most Good character subset.

4. according to the method described in claim 3, it is characterized in that, the step 1 specifically:

Remember that the sample set characteristic set is F_v, the sample set eigenmatrix is X_train, the Characteristic Number selected is M_v；Setting The initial threshold of information gain is a certain particular value θ_ig, setting information gain step size is λ, and it is initial that setting information gain recycles step number Value n=0 sets verification and measurement ratio threshold value as 0.95；Using machine learning classification algorithm to the sample set eigenmatrix X_trainIt carries out Training, writing down maximum verification and measurement ratio is TP_max。

5. according to the method described in claim 4, it is characterized in that, the step 2 includes:

Step 3: by machine learning classification algorithm to the intermediate features subsetCorresponding eigenmatrix is trained, and is obtained To corresponding verification and measurement ratio TP_tf；

Step 4: filtering out the intermediate features subsetThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic subsetBy machine learning classification algorithm to the character subsetCorresponding eigenmatrix is trained, and is obtained corresponding Verification and measurement ratio TP_tf′；

Step 5: comparing TP_tfWith TP_tf' value, if TP_tf=TP_tf', then by the character subsetIt is denoted as new intermediate spy Levy subsetReturn step 3；If TP_tf≠TP_tf', export the intermediate features subset

Step 6: by the intermediate features subsetIt is denoted as character subset F_v1, the Characteristic Number selected is M_v1, the character subset F_v1It is i.e. described to go uncorrelated features subset.

6. according to the method described in claim 5, it is characterized in that, the step 3 includes:

Step 2: circulation step number adds 1, i.e. n=n+1 on the original basis；

Step 3: selecting and meet IG > (θ_ig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted asIt selects and meets IG > (θ_ig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as

Step 4: comparingWithValue, ifReturn step 2；IfOutput is special Levy subset

Step 5: by the character subset of outputIt is denoted as character subset F_v2, the Characteristic Number selected is M_v2, feature Collect F_v2The i.e. described denoising character subset.

7. according to the method described in claim 6, it is characterized in that, the step 4 includes:

Step 1: calculating the denoising character subset F_v2In each feature and corresponding eigenmatrix CHI value, will wherein most Big CHI value is denoted as θ_chi；

Step 2: calculating the CHI value between feature, CHI value is greater than θ between selecting feature_chiFeature pair, and will wherein IG value it is lesser Feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for

Step 3: setting circulation step number m=0；

Step 4: circulation step number adds 1, i.e. m=m+1 on the original basis；

Step 5: according to the redundancy feature collectionF is rejected in putting in order for middle redundancy feature_v2In a redundancy feature, obtain To character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subsetIt is corresponding Eigenmatrix is trained, and obtains corresponding verification and measurement ratio

Step 7: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum verification and measurement ratioCorresponding character subset is denoted as F_v3, the Characteristic Number selected is M_v3, the character subset F_v3That is de-redundancy feature Collection.

8. the method according to the description of claim 7 is characterized in that the step 5 specifically:

By verification and measurement ratio described in the step 4With the maximum verification and measurement ratio TP_maxIt is compared, by the larger value in the two It is assigned to TP_max；By TP_maxCompared with initially setting verification and measurement ratio threshold value 0.95, if TP_max< 0.95, then return to the step 3；If TP_max>=0.95, then TP_maxCorresponding character subset is optimal feature subset, and note optimal feature subset is F_v。

9. a kind of malware detection device, which is characterized in that described device includes:

Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information being abstracted and turns to digital form, Obtain sample set characteristic set and sample set eigenmatrix；

Subset generation module: it for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm, obtains Optimal feature subset；

Detection model generation module: for using machine learning classification algorithm to the corresponding eigenmatrix of the optimal feature subset It is trained, generates detection model.

10. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes the side as described in claim 1 to 8 any one when executing described program Method.