CN109784046A - A kind of malware detection method, apparatus and electronic equipment - Google Patents

A kind of malware detection method, apparatus and electronic equipment Download PDF

Info

Publication number
CN109784046A
CN109784046A CN201811495637.7A CN201811495637A CN109784046A CN 109784046 A CN109784046 A CN 109784046A CN 201811495637 A CN201811495637 A CN 201811495637A CN 109784046 A CN109784046 A CN 109784046A
Authority
CN
China
Prior art keywords
feature
subset
characteristic
value
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811495637.7A
Other languages
Chinese (zh)
Other versions
CN109784046B (en
Inventor
胡一博
朱诗兵
李长青
帅海峰
吕登龙
徐华正
张记瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Original Assignee
Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peoples Liberation Army Strategic Support Force Aerospace Engineering University filed Critical Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority to CN201811495637.7A priority Critical patent/CN109784046B/en
Publication of CN109784046A publication Critical patent/CN109784046A/en
Application granted granted Critical
Publication of CN109784046B publication Critical patent/CN109784046B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of malware detection method, apparatus and electronic equipments, are related to the safety protection field of mobile terminal, can effectively detect Malware, and overcoming the prior art and extracting in characteristic of malware has redundancy, uncorrelated and noise.The malware detection method includes: feature extraction;Subset generates;Generate detection model.The malware detection device includes: characteristic extracting module, subset generation module and detection model generation module.The electronic equipment including memory, processor and stores the computer program that can be run on a memory and on a processor, and the processor realizes the malware detection method when executing described program.

Description

A kind of malware detection method, apparatus and electronic equipment
Technical field
The present invention relates to the security protection of mobile terminal, a kind of malware detection method, apparatus and electronics are particularly related to Equipment.
Background technique
Mobile intelligent terminal refers to network accessibility, all kinds of movable types equipped with operating system and application program eventually The general name at end.A large amount of universal and powerful functions of mobile intelligent terminal headed by android system become modern society The tool that each field of meeting can not be substituted.At the same time, the Malware to accompany together in mobile intelligent terminal is also gradually rampant, dislikes Software anticipate in the state of not detectable, destroys custom system, steals user data and rate, seriously threaten the privacy of user And property safety.More seriously, the related classified information such as national economy, politics, military affairs also receives threat, to national security Cause harm.In order to cope with the increasingly increased malware threats of mobile intelligent terminal, meet the following mobile intelligent terminal pair The testing requirements of unknown malware need a kind of detection method for Android malware.
The existing detection method using machine learning is from the angle of artificial intelligence, using sorting algorithm to known evil The feature of meaning software is learnt, construct continuous evolution and extensive intelligent monitoring model with realize to Android software from Dynamicization intelligent measurement.The key of this detection method is the selection of feature, the feature of selection more can effectively distinguish Malware and The efficiency of normal software, then the intelligent measurement model obtained using machine learning classification algorithm is higher, the detection to Malware Effect is better.However, there are redundancies, uncorrelated and noise problem for the feature for the Malware that existing method extracts: feature Redundancy influences the computational efficiency of sorting algorithm, reduces the validity of detection model;The irrelevance of feature results in the need for more Training sample can just obtain suitable detection model;The noise jamming of feature can directly result in the detection model that building makes mistake. The above problem can greatly increase the consumption of machine learning over time and space, carry out so as to cause sorting algorithm to feature It is entirely ineffective because of cost prohibitive when analysis processing.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of to meet mobile terminal and want the detection of unknown malware It asks, while can solve redundancy, uncorrelated and noise problem detection method, device and electronic equipment that Feature Selection is faced.
Based on above-mentioned purpose, the present invention provides a kind of malware detection methods.The malware detection method packet It includes:
The characteristic information is abstracted and turns to digital form by the characteristic information for extracting sample set software, obtains sample set spy Collection is closed and sample set eigenmatrix;
The invalid feature in the characteristic set is filtered using feature selecting algorithm, obtains optimal feature subset;
The corresponding eigenmatrix of the optimal feature subset is trained using machine learning classification algorithm, generates detection Model.
Optionally, the characteristic information for extracting sample set software, the characteristic information is abstracted and turns to digital form, is obtained Include: to sample set characteristic set and sample set eigenmatrix
Sample set software installation packet is handled, the global configuration file comprising authority information is obtained and is believed comprising API The decompiling file of breath;
Corresponding permission and API characteristic information are extracted from the global configuration file and the decompiling file;
The permission extracted and API characteristic information vectorization are abstracted as digital form, obtain sample set feature set It closes and sample set eigenmatrix.
The optional invalid feature filtered in the sample set characteristic set using feature selecting algorithm is obtained best Character subset includes:
Step 1: to relevant parameter constant in the sample set characteristic set and the sample set eigenmatrix and in son The relevant parameter used in collection generating process carries out Initialize installation;
Step 2: calculating the characteristic frequency of each feature in sample set characteristic set according to characteristic frequency calculation formula, leads to It crosses calculating and compares the incoherent feature of elimination, obtain uncorrelated features subset;
The characteristic frequency calculation formula:
Wherein, TF (fj) indicate feature fjCharacteristic frequency, NbenignIndicate that normal software concentrates normal sample number,Indicate feature fjThe sample number of appearance;NmalwareIndicate malice sample number in malice sample set,For spy Levy fjThe sample number of appearance;
Step 3: the information of each feature in uncorrelated features subset is gone to increase according to the calculating of information gain calculation formula Benefit compares screening by calculating and obtains denoising character subset;
The information gain calculation formula:
IG(fj)=H (Y)-H (Y | fj)
Wherein, IG (fj) indicate feature fjTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | fj) The conditional entropy of presentation class system;
Step 4: according to χ2Statistical value calculation formula, calculate in the denoising character subset each feature with it is corresponding CHI value (the χ of eigenmatrix2Statistical value) and feature between CHI value, compare screening by calculating and obtain de-redundancy character subset;
The χ2Statistical value calculation formula:
CHI(fi, fj)=ξ11122122
Wherein, CHI (fi, fj) indicate feature fi, fjχ2Statistical value, ξ11Indicate feature fiWith feature fjThe reason occurred simultaneously By the deviation of value and actual value, ξ12Indicate feature fiDo not occur feature f in the sample of appearancejTheoretical value and actual value it is inclined Difference, ξ21Expression does not occur feature fiSample number in there is feature fjTheoretical value and actual value deviation, ξ22Expression does not occur Feature fiSample number in do not occur feature f yetjTheoretical value and actual value deviation;
Step 5: analyzing and determining the de-redundancy character subset, and according to result progress subset optimization is determined, obtains To optimal feature subset.
Optionally, the step 1 specifically:
Remember that the sample set characteristic set is Fv, the sample set eigenmatrix is Xtrain, the Characteristic Number selected is Mv; The initial threshold of setting information gain is a certain particular value θig, setting information gain step size is λ, and setting information gain recycles step number Initial value n=0 sets verification and measurement ratio threshold value as 0.95;Using machine learning classification algorithm to the sample set eigenmatrix Xtrain It is trained, writing down maximum verification and measurement ratio is TPmax
Optionally, the step 2 includes:
Step 1: calculating the characteristic frequency of each feature in the sample set characteristic set;
Step 2: filtering out the feature that characteristic frequency value is 0, remaining feature forms intermediate features subset
Step 3: by machine learning classification algorithm to the intermediate features subsetCorresponding eigenmatrix is instructed Practice, obtains corresponding verification and measurement ratio TPtf
Step 4: filtering out the intermediate features subsetThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic CollectionBy machine learning classification algorithm to the character subsetCorresponding eigenmatrix is trained, and is obtained corresponding Verification and measurement ratio TPtf′;
Step 5: comparing TPtfWith TPtf' value, if TPtf=TPtf', then by the character subsetIt is denoted as in new Between character subsetReturn step 3;If TPtf≠TPtf', export the intermediate features subset
Step 6: by the intermediate features subsetIt is denoted as character subset Fv1, the Characteristic Number selected is Mv1, the spy Levy subset Fv1It is i.e. described to go uncorrelated features subset.
Optionally, the step 3 includes:
Step 1: the information gain of each feature in uncorrelated features subset is gone described in calculating;
Step 2: circulation step number adds 1, i.e. n=n+1 on the original basis;
Step 3: selecting and meet IG > (θig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is remembered ForIt selects and meets IG > (θig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as
Step 4: comparingWithValue, ifReturn step 2;IfIt is defeated Character subset out
Step 5: by the character subset of outputIt is denoted as character subset Fv2, the Characteristic Number selected is Mv2, described Character subset Fv2The i.e. described denoising character subset.
Optionally, the step 4 includes:
Step 1: calculating the denoising character subset Fv2In each feature and corresponding eigenmatrix CHI value, by it In maximum CHI value be denoted as θchi
Step 2: calculating the CHI value between feature, CHI value is greater than θ between selecting featurechiFeature pair, and will wherein IG value compared with Small feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for
Step 3: setting circulation step number m=0;
Step 4: circulation step number adds 1, i.e. m=m+1 on the original basis;
Step 5: according to the redundancy feature collectionF is rejected in putting in order for middle redundancy featurev2In a redundancy spy Sign, obtains character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subset Corresponding eigenmatrix is trained, and obtains corresponding verification and measurement ratio
Step 6: compare m withValue, ifReturn step 4;Otherwise it performs the next step;
Step 7: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum detection RateCorresponding character subset is denoted as Fv3, the Characteristic Number selected is Mv3, the character subset Fv3That is de-redundancy feature Subset.
Optionally, the step 5 specifically:
By verification and measurement ratio described in the step 4With the maximum verification and measurement ratio TPmaxIt is compared, it will be in the two The larger value is assigned to TPmax;By TPmaxCompared with initially setting verification and measurement ratio threshold value 0.95, if TPmax< 0.95, then return to the step Rapid three;If TPmax>=0.95, then TPmaxCorresponding character subset is optimal feature subset, and note optimal feature subset is Fv
Optionally, the generation detection model method specifically:
Verification and measurement ratio threshold value is set, bayesian algorithm, algorithm of support vector machine, decision Tree algorithms and arest neighbors point are utilized respectively Class algorithm is trained the corresponding eigenmatrix of optimal feature subset, selects optimal detection according to the verification and measurement ratio threshold value of setting Model output.
It is described that optimal detection model output method is selected according to the verification and measurement ratio threshold value of setting specifically:
If the verification and measurement ratio of the detection model as obtained by training is not less than threshold value, output phase answers detection model;
If the verification and measurement ratio of the detection model as obtained by training is lower than threshold value, changes the combination of feature, instruct again New detection model is got, until meeting threshold requirement, output meets the detection model of threshold requirement;
If traversing all possible feature combination, gained detection model verification and measurement ratio is not able to satisfy threshold requirement yet, Then export the highest detection model of verification and measurement ratio in ergodic process.
The present invention also provides a kind of malware detection device, described device includes:
Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information is abstracted and turns to number Form obtains sample set characteristic set and sample set eigenmatrix;
Subset generation module: for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm, Obtain optimal feature subset;
Detection model generation module: for using machine learning classification algorithm to the corresponding feature of the optimal feature subset Matrix is trained, and generates detection model.
The present invention also provides a kind of malware detection electronic equipment, including memory, processor and it is stored in storage On device and the computer program that can run on a processor, the processor are realized provided by the present invention when executing described program Malware detection method.
From the above it can be seen that a kind of malware detection method, apparatus provided by the invention and electronic equipment are logical The permission and sensitive API feature for extracting Malware and normal software are crossed, the feature obtained using feature selecting algorithm to extraction Make optimum choice, and permission and sensitive API assemblage characteristic using the training of machine learning classification algorithm Jing Guo selection, Ke Yiyou Effect detects Malware.Used feature selecting algorithm is based on feature frequency, information gain and χ2Statistics design: spy is utilized The method filtering characteristic for levying frequency, which is concentrated, out influences classification with incoherent feature of classifying, the method choice of use information gain Big feature, using χ2The method of statistics rejects the feature that redundancy is high in feature set.Therefore Malware provided by the invention It is redundancy present in the characteristic of malware that detection method can overcome well art methods to extract, uncorrelated and make an uproar The problem of sound.The feature selecting algorithm is by feature frequency, information gain and χ2These three methods are counted according to preferred Particular order combines, compared to by these three method simple combinations or select one or secondly simple combination have preferably it is excellent Change selection effect, be trained by the character subset obtained using machine learning classification algorithm to the feature selecting algorithm, Finally obtained detection model it is more efficient, it is more preferable to the detection effect of Malware.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the malware detection method schematic diagram in the embodiment of the present invention;
Fig. 2 is the malware detection method flow block diagram in the embodiment of the present invention;
Fig. 3 is feature extracting method flow diagram in the malware detection method in the embodiment of the present invention;
Fig. 4 is the malware detection method neutron set creation method schematic diagram figure in the embodiment of the present invention;
Fig. 5 removes uncorrelated method flow frame for characteristic frequency in the malware detection method in the embodiment of the present invention Figure;
Fig. 6 is information gain denoising method flow block diagram in the malware detection method in the embodiment of the present invention;
Fig. 7 is χ in the malware detection method in the embodiment of the present invention2Count de-redundancy method flow block diagram;
Fig. 8 is generation detection model method schematic diagram in the malware detection method in the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Attached drawing, the present invention is described in more detail.
One aspect of the present invention provides a kind of malware detection method.
As shown in figures 1 and 2, in a kind of some embodiments of malware detection method provided by the invention, the evil Meaning software detecting method specifically includes:
S101: feature extraction.Using reverse-engineering to software sample decompiling, characteristic information is extracted, characteristic information is taken out As the sample set characteristic set of the digital form to be easy to analyze is stored in database with sample set eigenmatrix;
S102: subset generates.The invalid feature in the sample set characteristic set is filtered using feature selecting algorithm, is obtained Optimal feature subset;
S103: detection model generates.Using machine learning classification algorithm to the corresponding feature square of the optimal feature subset Battle array is trained, and generates bicharacteristic detection model.
As shown in figure 3, in a kind of other embodiments of malware detection method provided by the invention, the feature Extracting method specifically includes:
S301: handling software APK packet, by the AndroidManifest.xml file decoding comprising authority information For the global configuration file of clear text format, the classes.dex file reverse comprising API information is compiled as .smali file;
S302: corresponding permission and API characteristic information are extracted from global configuration file and .smali file;
S303: the Feature Semantics information vector extracted is abstracted as digital form, obtains sample set characteristic set With sample set eigenmatrix.The sample set eigenmatrix specifically: indicate that certain feature appears in the sample with 1, use 0 indicates This feature does not appear in the sample, the feature of sample set is finally described with a binary sample collection eigenmatrix, wherein row table Show that sample vector, column indicate feature vector.
As shown in figure 4, in a kind of other embodiments of malware detection method provided by the invention, the subset Generation method specifically includes:
S401 step 1: constant initialization: to the sample set characteristic set to related the sample set eigenmatrix Constants and the relevant parameter used in subset generating process carry out Initialize installation.
S402 step 2: characteristic frequency is gone uncorrelated: calculating the sample set feature set according to characteristic frequency calculation formula The characteristic frequency of each feature in conjunction.Incoherent feature is filtered off by calculating to compare, obtains uncorrelated features subset;
The characteristic frequency calculation formula:
Wherein, TF (fj) indicate feature fjCharacteristic frequency.NbenignIndicate that normal software concentrates normal sample number,Indicate feature fjThe sample number of appearance;NmalwareIndicate malice sample number in malice sample set,For spy Levy fjThe sample number of appearance.
S403 step 3: information gain denoising: go uncorrelated features according to the calculating of information gain calculation formula The information gain for concentrating each feature compares screening by calculating and obtains denoising character subset;
The information gain calculation formula:
IG(fj)=H (Y)-H (Y | fj)
Wherein, IG (fj) indicate feature fjTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | fj) The conditional entropy of presentation class system.
Information gain calculation formula specific explanations are as follows:
The probability that normal software sample occurs is P (c0), the probability that Malware sample occurs is P (c1), then classification system The entropy of system is defined as:
Given feature fjConditional probability P (c of all categories when appearancei|fj=1), the then conditional entropy of categorizing system is defined as:
So, feature fjWhen not occurring, the entropy of categorizing system is defined as:
Wherein, probability P (ci) value be ciThe specific gravity of the total number of training of classification sample number Zhan;Probability P (fj=1) value It is feature f occurjSample number account for the ratio of total number of samples, probability P (fj=0) value is feature fjThe sample number not occurred accounts for The ratio of total number of samples.
Feature f as a result,jTo the information gain IG (f of categorizing systemj) calculation formula it is as follows:
S404 step 4: χ2Count de-redundancy: according to χ2Statistical value calculation formula calculates in the denoising character subset CHI value (the χ of each feature and corresponding eigenmatrix2Statistical value) and feature between CHI value.Compare screening by calculating to obtain De-redundancy character subset;
Two feature fi, fjχ2Statistical value calculation formula:
CHI(fi, fj)=ξ11122122
Wherein, CHI (fi, fj) indicate feature fi, fjχ2Statistical value, ξ11Indicate feature fiWith feature fjThe reason occurred simultaneously By the deviation of value and actual value;ξ12Indicate feature fiDo not occur feature f in the sample of appearancejTheoretical value and actual value it is inclined Difference;ξ21Expression does not occur feature fiSample number in there is feature fjTheoretical value and actual value deviation;ξ22Expression does not occur Feature fiSample number in do not occur feature f yetjTheoretical value and actual value deviation.
χ2Statistical value calculation formula specific explanations are as follows:
χ2Statistics is using the deviation of actual value and theoretical value come the degree of correlation between measures characteristic and classification.Assuming that two A feature fiAnd fj, two features simultaneously occur sample number beWhile the sample number not occurred isFeature fiOccur And fjThe sample number not occurred isFeature fiDo not occur and fjThe sample number of appearance isPhysical relationship between them It is as shown in table 1:
1 feature distribution table of table
Wherein, N is total number of samples, and value is the sum of four kinds of situations, i.e., By This, can obtain, feature fiThe frequency of appearance are as follows:
Feature fjThe sample number of appearance isTheoretically there is feature fjSample in, also there is feature fi's Sample number are as follows:
So, feature fiAnd fjSimultaneously occur to theoretical value and actual value deviation ξ11Are as follows:
Similarly, feature f can be acquirediOccur and feature fjThe theoretical sample number E not occurred12, feature fiDo not occur and feature fj The theoretical sample number E of appearance21, feature fiWith feature fjThe theoretical sample number E all not occurred22And their theoretical value and reality The deviation ξ of actual value12、ξ21、ξ22, calculation formula is as follows:
Therefore, two feature fiAnd fjχ2Statistical value is deviation ξ11、ξ12、ξ21、ξ22The sum of, i.e.,
S405 step 5: it generates optimal feature subset: the de-redundancy character subset is analyzed and determined, and according to sentencing Determine result and carry out further operating, obtains optimal feature subset to the end.
Wherein, the step 1 specifically:
Note sample set eigenmatrix is Xtrain, the feature set selected is Fv, the Characteristic Number selected is Mv.Setting information increases The initial threshold of benefit is a certain particular value θig, setting information gain step size is λ, and setting information gain recycles step number n=0, setting Verification and measurement ratio threshold value is 0.95.Using machine learning classification algorithm to primitive character matrix XtrainIt is trained, writes down maximum detection Rate is TPmax
As shown in figure 5, the step 2 specifically includes:
S501: the characteristic frequency of each feature in all sample set characteristic sets is calculated;
S502: filtering out the feature that characteristic frequency value is 0, and remaining feature forms intermediate features subset
S503: by machine learning classification algorithm to intermediate character subsetCorresponding eigenmatrix is trained, and is obtained Corresponding verification and measurement ratio TPtf
S504: intermediate features subset is filtered outThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic subsetBy machine learning classification algorithm to character subsetCorresponding eigenmatrix is trained, and is detected accordingly Rate TPtf′;
S505: compare TPtfWith TPtf' value, if TPtf=TPtf', then by character subsetIt is denoted as new intermediate spy Levy subsetReturn step S503;If TPtf≠TPtf', export intermediate features subset
S506: the character subset of output is denoted as Fv1, the Characteristic Number selected is Mv1, the character subset Fv1It is i.e. described Go uncorrelated features subset.
As shown in fig. 6, the step 3 specifically includes:
S601: the information gain of each feature in uncorrelated features subset is gone described in calculating;
S602: circulation step number adds 1, i.e. n=n+1 on the original basis;
S603: it selects and meets IG > (θig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is remembered ForIt selects and meets IG > (θig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as
S604: compareWithValue, ifReturn step S602;If Export character subset
S605: the character subset of output is denoted as Fv2, the Characteristic Number selected is Mv2, the character subset Fv2It is i.e. described Denoising character subset.
As shown in fig. 7, the step 4 specifically includes:
S701: character subset F is calculatedv2In each feature and corresponding eigenmatrix CHI value (χ2Statistical value), it will wherein Maximum CHI value is denoted as θchi
S702: calculating the CHI value between feature, and CHI value is greater than θ between selecting featurechiFeature pair, and will wherein IG value compared with Small feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for
S703: setting circulation step number m=0;
S704: circulation step number adds 1, i.e. m=m+1 on the original basis;
S705: according to redundancy feature collectionF is rejected in putting in order for middle redundancy featurev2In a redundancy feature, obtain To character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subsetIt is corresponding Eigenmatrix is trained, and obtains corresponding verification and measurement ratio
S706: compare m withValue, ifReturn step S704;Otherwise it performs the next step;
S707: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum verification and measurement ratioCorresponding character subset is denoted as Fv3, the Characteristic Number selected is Mv3, the character subset Fv3That is de-redundancy feature Collection.
Wherein, the step 5 generates optimal feature subset method specifically:
By gained verification and measurement ratio in the step 4With maximum verification and measurement ratio TPmaxIt is compared, it will be larger in the two Value is assigned to TPmax;By TPmaxCompared with initially setting verification and measurement ratio threshold value 0.95, if TPmax< 0.95, then return step three;If TPmax>=0.95, then TPmaxCorresponding character subset, that is, optimal feature subset, is denoted as F for itv
As shown in figure 8, in a kind of other embodiments of malware detection method provided by the invention, the generation Detection model method specifically:
Verification and measurement ratio threshold value is set first, is then utilized respectively bayesian algorithm (NB), algorithm of support vector machine (SVM), determines Plan tree algorithm (DT) and arest neighbors sorting algorithm (KNN) are trained permission and sensitive API feature, according to the verification and measurement ratio of setting Threshold value selects optimal detection model output.
As shown in figure 8, in a kind of other embodiments of malware detection method provided by the invention, the basis The method that the detection threshold value of setting selects optimal detection model output specifically:
If the verification and measurement ratio of the detection model as obtained by training is not less than threshold value, output phase answers detection model;
If the verification and measurement ratio of the detection model as obtained by training is lower than threshold value, changes the combination of feature, instruct again New detection model is got, until meeting threshold requirement, output meets the detection model of threshold requirement;
If traversing all possible feature combination, gained detection model verification and measurement ratio is not able to satisfy threshold requirement yet, Then export the highest detection model of verification and measurement ratio in ergodic process.
As shown in table 2, be in a kind of embodiment of malware detection method provided by the invention to detection performance into The result of row test.
2 detection performance result of table
Specific implementation method are as follows:
Using peace intelligence in the market 5000 normal softwares through detecting and upper 5000 Malwares of VirusShare as Sample set, is tested that (it is similar that sample data is divided into 10 sizes by 10 folding cross validations using 10 folding cross-validation methods Exclusive subsets, use the union of 9 subsets as training set every time, that subset of remainder is as test set, thus progress 10 Secondary training and test, finally obtained is the mean value of this 10 test results).
Feature selecting algorithm to unused feature selecting algorithm and based on characteristic frequency, information gain and statistics respectively Detection performance compares and analyzes.
Wherein, the meaning of performance indicator is as follows
(1) TPR (verification and measurement ratio) is the ratio of classifier final classification correct positive example and practical positive example, and TPR is bigger, shows Classifier is better to positive example classifying quality.Calculation formula is as follows:
(2) FPR (rate of false alarm) is the positive example of classifier final classification mistake and the ratio of practical counter-example, and FPR is bigger, shows Classifier is poorer to counter-example classifying quality.Calculation formula is as follows:
(3) Acc (accuracy rate) is the ratio of classifier final all classification correct samples and total sample, presentation class device Accurate classification degree, Acc is bigger, shows that the whole classification capacity of the classifier is better.Calculation formula is as follows:
In formula, TP (real example) is the number that the sample that truth is positive is detected as positive example, i.e. detection is correct just Example;It is the number that anti-sample is detected as positive example that FP (false positive example), which is truth, that is, detects the counter-example of mistake;FN is (false anti- Example) it is that the sample that truth is positive is detected as the number of counter-example, the i.e. positive example of classification error;TN (true counter-example) is true feelings Condition is the number that anti-sample is detected as counter-example, that is, detects correct counter-example.
It can be seen that malware detection method proposed by the present invention according to testing inspection result and use feature selecting algorithm After software features are in optimized selection, software features are trained using different machine learning classification algorithms in four and are obtained The verification and measurement ratio of detection model, accuracy rate is above the malware detection method of traditional unused feature selecting algorithm and obtains The verification and measurement ratio of detection model, accuracy rate, and the rate of false alarm for the detection model that malware detection method proposed by the present invention obtains It is below the rate of false alarm for the detection model that traditional malware detection method obtains.
This illustrates that a kind of detection efficiency of malware detection method proposed by the present invention is higher, and detection effect is more preferable.
Another aspect of the present invention provides a kind of malware detection device.
In a kind of some embodiments of malware detection device provided by the invention, described device includes:
Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information is abstracted and turns to number Form obtains sample set characteristic set and sample set eigenmatrix;
Subset generation module: for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm, Obtain optimal feature subset;
Detection model generation module: for using machine learning classification algorithm to the corresponding feature of the optimal feature subset Matrix is trained, and generates detection model.
Another aspect of the present invention provides a kind of malware detection electronic equipment.
The present invention provides in some embodiments of malware detection electronic equipment, the electronic equipment includes:
Memory, processor and storage are on a memory and the computer program that can run on a processor.
The processor realizes malware detection method provided by the present invention when executing described program.
The device of above-described embodiment, for realizing method corresponding in previous embodiment, and has corresponding with electronic equipment Embodiment of the method beneficial effect, details are not described herein.
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under thinking of the invention, above embodiments Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.
In addition, to simplify explanation and discussing, and in order not to obscure the invention, it can in provided attached drawing It is connect with showing or can not show with the well known power ground of integrated circuit (IC) chip and other components.Furthermore, it is possible to Device is shown in block diagram form, to avoid obscuring the invention, and this has also contemplated following facts, i.e., about this The details of the embodiment of a little block diagram arrangements be height depend on will implementing platform of the invention (that is, these details should It is completely within the scope of the understanding of those skilled in the art).Elaborating that detail (for example, circuit) is of the invention to describe In the case where exemplary embodiment, it will be apparent to those skilled in the art that can be in these no details In the case where or implement the present invention in the case that these details change.Therefore, these descriptions should be considered as explanation Property rather than it is restrictive.
Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).
The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims, Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made Deng should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of malware detection method characterized by comprising
The characteristic information is abstracted and turns to digital form, obtains sample set feature set by the characteristic information for extracting sample set software It closes and sample set eigenmatrix;
The invalid feature in the characteristic set is filtered using feature selecting algorithm, obtains optimal feature subset;
The corresponding eigenmatrix of the optimal feature subset is trained using machine learning classification algorithm, generates detection mould Type.
2., will be described the method according to claim 1, wherein the characteristic information for extracting sample set software Characteristic information is abstract to turn to digital form, and obtaining sample set characteristic set with sample set eigenmatrix includes:
Sample set software installation packet is handled, obtains the global configuration file comprising authority information and comprising API information Decompiling file;
Corresponding permission and API characteristic information are extracted from the global configuration file and the decompiling file;
The permission extracted and API characteristic information vectorization are abstracted as digital form, obtain sample set characteristic set with Sample set eigenmatrix.
3. the method according to claim 1, wherein described filter the sample set spy using feature selecting algorithm Invalid feature in collection conjunction, obtaining optimal feature subset includes:
Step 1: to relevant parameter constant in the sample set characteristic set and the sample set eigenmatrix and in subset life Initialize installation is carried out at the relevant parameter used in the process;
Step 2: the characteristic frequency of each feature in sample set characteristic set is calculated according to characteristic frequency calculation formula, passes through meter Calculation, which is compared, filters off incoherent feature, obtains uncorrelated features subset;
The characteristic frequency calculation formula:
Wherein, TF (fj) indicate feature fjCharacteristic frequency, NbenignIndicate that normal software concentrates normal sample number,Table Show feature fjThe sample number of appearance;NmalwareIndicate malice sample number in malice sample set,It is characterized fjOccur Sample number;
Step 3: going the information gain of each feature in uncorrelated features subset according to the calculating of information gain calculation formula, Compare screening by calculating and obtains denoising character subset;
The information gain calculation formula:
IG(fj)=H (Y)-H (Y | fj)
Wherein, IG (fj) indicate feature fjTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | fj) indicate The conditional entropy of categorizing system;
Step 4: according to χ2Statistical value calculation formula calculates each feature and corresponding feature square in the denoising character subset CHI value (the χ of battle array2Statistical value) and feature between CHI value, compare screening by calculating and obtain de-redundancy character subset;
The χ2Statistical value calculation formula:
CHI(fi, fj)=ξ11122122
Wherein, CHI (fi, fj) indicate feature fi, fjχ2Statistical value, ξ11Indicate feature fiWith feature fjThe theoretical value occurred simultaneously And the deviation of actual value, ξ12Indicate feature fiDo not occur feature f in the sample of appearancejTheoretical value and actual value deviation, ξ21 Expression does not occur feature fiSample number in there is feature fjTheoretical value and actual value deviation, ξ22Expression does not occur feature fi Sample number in do not occur feature f yetjTheoretical value and actual value deviation;
Step 5: analyzing and determining the de-redundancy character subset, and according to result progress subset optimization is determined, obtains most Good character subset.
4. according to the method described in claim 3, it is characterized in that, the step 1 specifically:
Remember that the sample set characteristic set is Fv, the sample set eigenmatrix is Xtrain, the Characteristic Number selected is Mv;Setting The initial threshold of information gain is a certain particular value θig, setting information gain step size is λ, and it is initial that setting information gain recycles step number Value n=0 sets verification and measurement ratio threshold value as 0.95;Using machine learning classification algorithm to the sample set eigenmatrix XtrainIt carries out Training, writing down maximum verification and measurement ratio is TPmax
5. according to the method described in claim 4, it is characterized in that, the step 2 includes:
Step 1: calculating the characteristic frequency of each feature in the sample set characteristic set;
Step 2: filtering out the feature that characteristic frequency value is 0, remaining feature forms intermediate features subset
Step 3: by machine learning classification algorithm to the intermediate features subsetCorresponding eigenmatrix is trained, and is obtained To corresponding verification and measurement ratio TPtf
Step 4: filtering out the intermediate features subsetThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic subsetBy machine learning classification algorithm to the character subsetCorresponding eigenmatrix is trained, and is obtained corresponding Verification and measurement ratio TPtf′;
Step 5: comparing TPtfWith TPtf' value, if TPtf=TPtf', then by the character subsetIt is denoted as new intermediate spy Levy subsetReturn step 3;If TPtf≠TPtf', export the intermediate features subset
Step 6: by the intermediate features subsetIt is denoted as character subset Fv1, the Characteristic Number selected is Mv1, the character subset Fv1It is i.e. described to go uncorrelated features subset.
6. according to the method described in claim 5, it is characterized in that, the step 3 includes:
Step 1: the information gain of each feature in uncorrelated features subset is gone described in calculating;
Step 2: circulation step number adds 1, i.e. n=n+1 on the original basis;
Step 3: selecting and meet IG > (θig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted asIt selects and meets IG > (θig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as
Step 4: comparingWithValue, ifReturn step 2;IfOutput is special Levy subset
Step 5: by the character subset of outputIt is denoted as character subset Fv2, the Characteristic Number selected is Mv2, feature Collect Fv2The i.e. described denoising character subset.
7. according to the method described in claim 6, it is characterized in that, the step 4 includes:
Step 1: calculating the denoising character subset Fv2In each feature and corresponding eigenmatrix CHI value, will wherein most Big CHI value is denoted as θchi
Step 2: calculating the CHI value between feature, CHI value is greater than θ between selecting featurechiFeature pair, and will wherein IG value it is lesser Feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for
Step 3: setting circulation step number m=0;
Step 4: circulation step number adds 1, i.e. m=m+1 on the original basis;
Step 5: according to the redundancy feature collectionF is rejected in putting in order for middle redundancy featurev2In a redundancy feature, obtain To character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subsetIt is corresponding Eigenmatrix is trained, and obtains corresponding verification and measurement ratio
Step 6: compare m withValue, ifReturn step 4;Otherwise it performs the next step;
Step 7: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum verification and measurement ratioCorresponding character subset is denoted as Fv3, the Characteristic Number selected is Mv3, the character subset Fv3That is de-redundancy feature Collection.
8. the method according to the description of claim 7 is characterized in that the step 5 specifically:
By verification and measurement ratio described in the step 4With the maximum verification and measurement ratio TPmaxIt is compared, by the larger value in the two It is assigned to TPmax;By TPmaxCompared with initially setting verification and measurement ratio threshold value 0.95, if TPmax< 0.95, then return to the step 3;If TPmax>=0.95, then TPmaxCorresponding character subset is optimal feature subset, and note optimal feature subset is Fv
9. a kind of malware detection device, which is characterized in that described device includes:
Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information being abstracted and turns to digital form, Obtain sample set characteristic set and sample set eigenmatrix;
Subset generation module: it for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm, obtains Optimal feature subset;
Detection model generation module: for using machine learning classification algorithm to the corresponding eigenmatrix of the optimal feature subset It is trained, generates detection model.
10. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes the side as described in claim 1 to 8 any one when executing described program Method.
CN201811495637.7A 2018-12-07 2018-12-07 Malicious software detection method and device and electronic equipment Active CN109784046B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811495637.7A CN109784046B (en) 2018-12-07 2018-12-07 Malicious software detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811495637.7A CN109784046B (en) 2018-12-07 2018-12-07 Malicious software detection method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN109784046A true CN109784046A (en) 2019-05-21
CN109784046B CN109784046B (en) 2021-02-02

Family

ID=66495778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811495637.7A Active CN109784046B (en) 2018-12-07 2018-12-07 Malicious software detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN109784046B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110464345A (en) * 2019-08-22 2019-11-19 北京航空航天大学 A kind of separate head bioelectrical power signal interference elimination method and system
CN110955895A (en) * 2019-11-29 2020-04-03 珠海豹趣科技有限公司 Operation interception method and device and computer readable storage medium
CN110990834A (en) * 2019-11-19 2020-04-10 重庆邮电大学 Static detection method, system and medium for android malicious software
CN112632539A (en) * 2020-12-28 2021-04-09 西北工业大学 Dynamic and static mixed feature extraction method in Android system malicious software detection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2128798A1 (en) * 2008-05-27 2009-12-02 Deutsche Telekom AG Unknown malcode detection using classifiers with optimal training sets
CN104298715A (en) * 2014-09-16 2015-01-21 北京航空航天大学 TF-IDF based multiple-index result merging and sequencing method
CN105320887A (en) * 2015-10-12 2016-02-10 湖南大学 Static characteristic extraction and selection based detection method for Android malicious application
CN107577942A (en) * 2017-08-22 2018-01-12 中国民航大学 A kind of composite character screening technique for Android malware detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2128798A1 (en) * 2008-05-27 2009-12-02 Deutsche Telekom AG Unknown malcode detection using classifiers with optimal training sets
CN104298715A (en) * 2014-09-16 2015-01-21 北京航空航天大学 TF-IDF based multiple-index result merging and sequencing method
CN105320887A (en) * 2015-10-12 2016-02-10 湖南大学 Static characteristic extraction and selection based detection method for Android malicious application
CN107577942A (en) * 2017-08-22 2018-01-12 中国民航大学 A kind of composite character screening technique for Android malware detection

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110464345A (en) * 2019-08-22 2019-11-19 北京航空航天大学 A kind of separate head bioelectrical power signal interference elimination method and system
CN110990834A (en) * 2019-11-19 2020-04-10 重庆邮电大学 Static detection method, system and medium for android malicious software
CN110990834B (en) * 2019-11-19 2022-12-27 重庆邮电大学 Static detection method, system and medium for android malicious software
CN110955895A (en) * 2019-11-29 2020-04-03 珠海豹趣科技有限公司 Operation interception method and device and computer readable storage medium
CN110955895B (en) * 2019-11-29 2022-03-29 珠海豹趣科技有限公司 Operation interception method and device and computer readable storage medium
CN112632539A (en) * 2020-12-28 2021-04-09 西北工业大学 Dynamic and static mixed feature extraction method in Android system malicious software detection
CN112632539B (en) * 2020-12-28 2024-04-09 西北工业大学 Dynamic and static hybrid feature extraction method in Android system malicious software detection

Also Published As

Publication number Publication date
CN109784046B (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN109784046A (en) A kind of malware detection method, apparatus and electronic equipment
US9298912B2 (en) System and method for distinguishing human swipe input sequence behavior and using a confidence value on a score to detect fraudsters
Gou et al. Improved pseudo nearest neighbor classification
Mandhare et al. A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques
CN112733749A (en) Real-time pedestrian detection method integrating attention mechanism
CN109145765B (en) Face detection method and device, computer equipment and storage medium
CN109886284B (en) Fraud detection method and system based on hierarchical clustering
Yang et al. A deep multiscale pyramid network enhanced with spatial–spectral residual attention for hyperspectral image change detection
CN108022146A (en) Characteristic item processing method, device, the computer equipment of collage-credit data
CN110084609B (en) Transaction fraud behavior deep detection method based on characterization learning
CN103473556A (en) Hierarchical support vector machine classifying method based on rejection subspace
CN112437053B (en) Intrusion detection method and device
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
Halim et al. Recurrent neural network for malware detection
CN103164710A (en) Selection integrated face identifying method based on compressed sensing
CN102324007A (en) Method for detecting abnormality based on data mining
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
Tebyanian et al. SC-COTD: Hardware trojan detection based on sequential/combinational testability features using ensemble classifier
Shukla et al. A unique approach for detection of fake news using machine learning
CN116260565A (en) Chip electromagnetic side channel analysis method, system and storage medium
Cui et al. Hardware trojan detection based on cluster analysis of mahalanobis distance
CN110390215A (en) A kind of hardware Trojan horse detection method and system based on raising activation probability
Shrivastava et al. A SVM and K-means clustering based fast and efficient intrusion detection system
Li et al. A novel fingerprint indexing approach focusing on minutia location and direction
CN107898458A (en) Single examination time brain electricity P300 component detection methods and device based on image prior

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant