CN109784046A - A kind of malware detection method, apparatus and electronic equipment - Google Patents
A kind of malware detection method, apparatus and electronic equipment Download PDFInfo
- Publication number
- CN109784046A CN109784046A CN201811495637.7A CN201811495637A CN109784046A CN 109784046 A CN109784046 A CN 109784046A CN 201811495637 A CN201811495637 A CN 201811495637A CN 109784046 A CN109784046 A CN 109784046A
- Authority
- CN
- China
- Prior art keywords
- feature
- subset
- characteristic
- value
- sample set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a kind of malware detection method, apparatus and electronic equipments, are related to the safety protection field of mobile terminal, can effectively detect Malware, and overcoming the prior art and extracting in characteristic of malware has redundancy, uncorrelated and noise.The malware detection method includes: feature extraction;Subset generates;Generate detection model.The malware detection device includes: characteristic extracting module, subset generation module and detection model generation module.The electronic equipment including memory, processor and stores the computer program that can be run on a memory and on a processor, and the processor realizes the malware detection method when executing described program.
Description
Technical field
The present invention relates to the security protection of mobile terminal, a kind of malware detection method, apparatus and electronics are particularly related to
Equipment.
Background technique
Mobile intelligent terminal refers to network accessibility, all kinds of movable types equipped with operating system and application program eventually
The general name at end.A large amount of universal and powerful functions of mobile intelligent terminal headed by android system become modern society
The tool that each field of meeting can not be substituted.At the same time, the Malware to accompany together in mobile intelligent terminal is also gradually rampant, dislikes
Software anticipate in the state of not detectable, destroys custom system, steals user data and rate, seriously threaten the privacy of user
And property safety.More seriously, the related classified information such as national economy, politics, military affairs also receives threat, to national security
Cause harm.In order to cope with the increasingly increased malware threats of mobile intelligent terminal, meet the following mobile intelligent terminal pair
The testing requirements of unknown malware need a kind of detection method for Android malware.
The existing detection method using machine learning is from the angle of artificial intelligence, using sorting algorithm to known evil
The feature of meaning software is learnt, construct continuous evolution and extensive intelligent monitoring model with realize to Android software from
Dynamicization intelligent measurement.The key of this detection method is the selection of feature, the feature of selection more can effectively distinguish Malware and
The efficiency of normal software, then the intelligent measurement model obtained using machine learning classification algorithm is higher, the detection to Malware
Effect is better.However, there are redundancies, uncorrelated and noise problem for the feature for the Malware that existing method extracts: feature
Redundancy influences the computational efficiency of sorting algorithm, reduces the validity of detection model;The irrelevance of feature results in the need for more
Training sample can just obtain suitable detection model;The noise jamming of feature can directly result in the detection model that building makes mistake.
The above problem can greatly increase the consumption of machine learning over time and space, carry out so as to cause sorting algorithm to feature
It is entirely ineffective because of cost prohibitive when analysis processing.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of to meet mobile terminal and want the detection of unknown malware
It asks, while can solve redundancy, uncorrelated and noise problem detection method, device and electronic equipment that Feature Selection is faced.
Based on above-mentioned purpose, the present invention provides a kind of malware detection methods.The malware detection method packet
It includes:
The characteristic information is abstracted and turns to digital form by the characteristic information for extracting sample set software, obtains sample set spy
Collection is closed and sample set eigenmatrix;
The invalid feature in the characteristic set is filtered using feature selecting algorithm, obtains optimal feature subset;
The corresponding eigenmatrix of the optimal feature subset is trained using machine learning classification algorithm, generates detection
Model.
Optionally, the characteristic information for extracting sample set software, the characteristic information is abstracted and turns to digital form, is obtained
Include: to sample set characteristic set and sample set eigenmatrix
Sample set software installation packet is handled, the global configuration file comprising authority information is obtained and is believed comprising API
The decompiling file of breath;
Corresponding permission and API characteristic information are extracted from the global configuration file and the decompiling file;
The permission extracted and API characteristic information vectorization are abstracted as digital form, obtain sample set feature set
It closes and sample set eigenmatrix.
The optional invalid feature filtered in the sample set characteristic set using feature selecting algorithm is obtained best
Character subset includes:
Step 1: to relevant parameter constant in the sample set characteristic set and the sample set eigenmatrix and in son
The relevant parameter used in collection generating process carries out Initialize installation;
Step 2: calculating the characteristic frequency of each feature in sample set characteristic set according to characteristic frequency calculation formula, leads to
It crosses calculating and compares the incoherent feature of elimination, obtain uncorrelated features subset;
The characteristic frequency calculation formula:
Wherein, TF (fj) indicate feature fjCharacteristic frequency, NbenignIndicate that normal software concentrates normal sample number,Indicate feature fjThe sample number of appearance;NmalwareIndicate malice sample number in malice sample set,For spy
Levy fjThe sample number of appearance;
Step 3: the information of each feature in uncorrelated features subset is gone to increase according to the calculating of information gain calculation formula
Benefit compares screening by calculating and obtains denoising character subset;
The information gain calculation formula:
IG(fj)=H (Y)-H (Y | fj)
Wherein, IG (fj) indicate feature fjTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | fj)
The conditional entropy of presentation class system;
Step 4: according to χ2Statistical value calculation formula, calculate in the denoising character subset each feature with it is corresponding
CHI value (the χ of eigenmatrix2Statistical value) and feature between CHI value, compare screening by calculating and obtain de-redundancy character subset;
The χ2Statistical value calculation formula:
CHI(fi, fj)=ξ11+ξ12+ξ21+ξ22
Wherein, CHI (fi, fj) indicate feature fi, fjχ2Statistical value, ξ11Indicate feature fiWith feature fjThe reason occurred simultaneously
By the deviation of value and actual value, ξ12Indicate feature fiDo not occur feature f in the sample of appearancejTheoretical value and actual value it is inclined
Difference, ξ21Expression does not occur feature fiSample number in there is feature fjTheoretical value and actual value deviation, ξ22Expression does not occur
Feature fiSample number in do not occur feature f yetjTheoretical value and actual value deviation;
Step 5: analyzing and determining the de-redundancy character subset, and according to result progress subset optimization is determined, obtains
To optimal feature subset.
Optionally, the step 1 specifically:
Remember that the sample set characteristic set is Fv, the sample set eigenmatrix is Xtrain, the Characteristic Number selected is Mv;
The initial threshold of setting information gain is a certain particular value θig, setting information gain step size is λ, and setting information gain recycles step number
Initial value n=0 sets verification and measurement ratio threshold value as 0.95;Using machine learning classification algorithm to the sample set eigenmatrix Xtrain
It is trained, writing down maximum verification and measurement ratio is TPmax。
Optionally, the step 2 includes:
Step 1: calculating the characteristic frequency of each feature in the sample set characteristic set;
Step 2: filtering out the feature that characteristic frequency value is 0, remaining feature forms intermediate features subset
Step 3: by machine learning classification algorithm to the intermediate features subsetCorresponding eigenmatrix is instructed
Practice, obtains corresponding verification and measurement ratio TPtf;
Step 4: filtering out the intermediate features subsetThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic
CollectionBy machine learning classification algorithm to the character subsetCorresponding eigenmatrix is trained, and is obtained corresponding
Verification and measurement ratio TPtf′;
Step 5: comparing TPtfWith TPtf' value, if TPtf=TPtf', then by the character subsetIt is denoted as in new
Between character subsetReturn step 3;If TPtf≠TPtf', export the intermediate features subset
Step 6: by the intermediate features subsetIt is denoted as character subset Fv1, the Characteristic Number selected is Mv1, the spy
Levy subset Fv1It is i.e. described to go uncorrelated features subset.
Optionally, the step 3 includes:
Step 1: the information gain of each feature in uncorrelated features subset is gone described in calculating;
Step 2: circulation step number adds 1, i.e. n=n+1 on the original basis;
Step 3: selecting and meet IG > (θig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is remembered
ForIt selects and meets IG > (θig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as
Step 4: comparingWithValue, ifReturn step 2;IfIt is defeated
Character subset out
Step 5: by the character subset of outputIt is denoted as character subset Fv2, the Characteristic Number selected is Mv2, described
Character subset Fv2The i.e. described denoising character subset.
Optionally, the step 4 includes:
Step 1: calculating the denoising character subset Fv2In each feature and corresponding eigenmatrix CHI value, by it
In maximum CHI value be denoted as θchi;
Step 2: calculating the CHI value between feature, CHI value is greater than θ between selecting featurechiFeature pair, and will wherein IG value compared with
Small feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for
Step 3: setting circulation step number m=0;
Step 4: circulation step number adds 1, i.e. m=m+1 on the original basis;
Step 5: according to the redundancy feature collectionF is rejected in putting in order for middle redundancy featurev2In a redundancy spy
Sign, obtains character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subset
Corresponding eigenmatrix is trained, and obtains corresponding verification and measurement ratio
Step 6: compare m withValue, ifReturn step 4;Otherwise it performs the next step;
Step 7: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum detection
RateCorresponding character subset is denoted as Fv3, the Characteristic Number selected is Mv3, the character subset Fv3That is de-redundancy feature
Subset.
Optionally, the step 5 specifically:
By verification and measurement ratio described in the step 4With the maximum verification and measurement ratio TPmaxIt is compared, it will be in the two
The larger value is assigned to TPmax;By TPmaxCompared with initially setting verification and measurement ratio threshold value 0.95, if TPmax< 0.95, then return to the step
Rapid three;If TPmax>=0.95, then TPmaxCorresponding character subset is optimal feature subset, and note optimal feature subset is Fv。
Optionally, the generation detection model method specifically:
Verification and measurement ratio threshold value is set, bayesian algorithm, algorithm of support vector machine, decision Tree algorithms and arest neighbors point are utilized respectively
Class algorithm is trained the corresponding eigenmatrix of optimal feature subset, selects optimal detection according to the verification and measurement ratio threshold value of setting
Model output.
It is described that optimal detection model output method is selected according to the verification and measurement ratio threshold value of setting specifically:
If the verification and measurement ratio of the detection model as obtained by training is not less than threshold value, output phase answers detection model;
If the verification and measurement ratio of the detection model as obtained by training is lower than threshold value, changes the combination of feature, instruct again
New detection model is got, until meeting threshold requirement, output meets the detection model of threshold requirement;
If traversing all possible feature combination, gained detection model verification and measurement ratio is not able to satisfy threshold requirement yet,
Then export the highest detection model of verification and measurement ratio in ergodic process.
The present invention also provides a kind of malware detection device, described device includes:
Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information is abstracted and turns to number
Form obtains sample set characteristic set and sample set eigenmatrix;
Subset generation module: for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm,
Obtain optimal feature subset;
Detection model generation module: for using machine learning classification algorithm to the corresponding feature of the optimal feature subset
Matrix is trained, and generates detection model.
The present invention also provides a kind of malware detection electronic equipment, including memory, processor and it is stored in storage
On device and the computer program that can run on a processor, the processor are realized provided by the present invention when executing described program
Malware detection method.
From the above it can be seen that a kind of malware detection method, apparatus provided by the invention and electronic equipment are logical
The permission and sensitive API feature for extracting Malware and normal software are crossed, the feature obtained using feature selecting algorithm to extraction
Make optimum choice, and permission and sensitive API assemblage characteristic using the training of machine learning classification algorithm Jing Guo selection, Ke Yiyou
Effect detects Malware.Used feature selecting algorithm is based on feature frequency, information gain and χ2Statistics design: spy is utilized
The method filtering characteristic for levying frequency, which is concentrated, out influences classification with incoherent feature of classifying, the method choice of use information gain
Big feature, using χ2The method of statistics rejects the feature that redundancy is high in feature set.Therefore Malware provided by the invention
It is redundancy present in the characteristic of malware that detection method can overcome well art methods to extract, uncorrelated and make an uproar
The problem of sound.The feature selecting algorithm is by feature frequency, information gain and χ2These three methods are counted according to preferred
Particular order combines, compared to by these three method simple combinations or select one or secondly simple combination have preferably it is excellent
Change selection effect, be trained by the character subset obtained using machine learning classification algorithm to the feature selecting algorithm,
Finally obtained detection model it is more efficient, it is more preferable to the detection effect of Malware.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is the malware detection method schematic diagram in the embodiment of the present invention;
Fig. 2 is the malware detection method flow block diagram in the embodiment of the present invention;
Fig. 3 is feature extracting method flow diagram in the malware detection method in the embodiment of the present invention;
Fig. 4 is the malware detection method neutron set creation method schematic diagram figure in the embodiment of the present invention;
Fig. 5 removes uncorrelated method flow frame for characteristic frequency in the malware detection method in the embodiment of the present invention
Figure;
Fig. 6 is information gain denoising method flow block diagram in the malware detection method in the embodiment of the present invention;
Fig. 7 is χ in the malware detection method in the embodiment of the present invention2Count de-redundancy method flow block diagram;
Fig. 8 is generation detection model method schematic diagram in the malware detection method in the embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference
Attached drawing, the present invention is described in more detail.
One aspect of the present invention provides a kind of malware detection method.
As shown in figures 1 and 2, in a kind of some embodiments of malware detection method provided by the invention, the evil
Meaning software detecting method specifically includes:
S101: feature extraction.Using reverse-engineering to software sample decompiling, characteristic information is extracted, characteristic information is taken out
As the sample set characteristic set of the digital form to be easy to analyze is stored in database with sample set eigenmatrix;
S102: subset generates.The invalid feature in the sample set characteristic set is filtered using feature selecting algorithm, is obtained
Optimal feature subset;
S103: detection model generates.Using machine learning classification algorithm to the corresponding feature square of the optimal feature subset
Battle array is trained, and generates bicharacteristic detection model.
As shown in figure 3, in a kind of other embodiments of malware detection method provided by the invention, the feature
Extracting method specifically includes:
S301: handling software APK packet, by the AndroidManifest.xml file decoding comprising authority information
For the global configuration file of clear text format, the classes.dex file reverse comprising API information is compiled as .smali file;
S302: corresponding permission and API characteristic information are extracted from global configuration file and .smali file;
S303: the Feature Semantics information vector extracted is abstracted as digital form, obtains sample set characteristic set
With sample set eigenmatrix.The sample set eigenmatrix specifically: indicate that certain feature appears in the sample with 1, use 0 indicates
This feature does not appear in the sample, the feature of sample set is finally described with a binary sample collection eigenmatrix, wherein row table
Show that sample vector, column indicate feature vector.
As shown in figure 4, in a kind of other embodiments of malware detection method provided by the invention, the subset
Generation method specifically includes:
S401 step 1: constant initialization: to the sample set characteristic set to related the sample set eigenmatrix
Constants and the relevant parameter used in subset generating process carry out Initialize installation.
S402 step 2: characteristic frequency is gone uncorrelated: calculating the sample set feature set according to characteristic frequency calculation formula
The characteristic frequency of each feature in conjunction.Incoherent feature is filtered off by calculating to compare, obtains uncorrelated features subset;
The characteristic frequency calculation formula:
Wherein, TF (fj) indicate feature fjCharacteristic frequency.NbenignIndicate that normal software concentrates normal sample number,Indicate feature fjThe sample number of appearance;NmalwareIndicate malice sample number in malice sample set,For spy
Levy fjThe sample number of appearance.
S403 step 3: information gain denoising: go uncorrelated features according to the calculating of information gain calculation formula
The information gain for concentrating each feature compares screening by calculating and obtains denoising character subset;
The information gain calculation formula:
IG(fj)=H (Y)-H (Y | fj)
Wherein, IG (fj) indicate feature fjTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | fj)
The conditional entropy of presentation class system.
Information gain calculation formula specific explanations are as follows:
The probability that normal software sample occurs is P (c0), the probability that Malware sample occurs is P (c1), then classification system
The entropy of system is defined as:
Given feature fjConditional probability P (c of all categories when appearancei|fj=1), the then conditional entropy of categorizing system is defined as:
So, feature fjWhen not occurring, the entropy of categorizing system is defined as:
Wherein, probability P (ci) value be ciThe specific gravity of the total number of training of classification sample number Zhan;Probability P (fj=1) value
It is feature f occurjSample number account for the ratio of total number of samples, probability P (fj=0) value is feature fjThe sample number not occurred accounts for
The ratio of total number of samples.
Feature f as a result,jTo the information gain IG (f of categorizing systemj) calculation formula it is as follows:
S404 step 4: χ2Count de-redundancy: according to χ2Statistical value calculation formula calculates in the denoising character subset
CHI value (the χ of each feature and corresponding eigenmatrix2Statistical value) and feature between CHI value.Compare screening by calculating to obtain
De-redundancy character subset;
Two feature fi, fjχ2Statistical value calculation formula:
CHI(fi, fj)=ξ11+ξ12+ξ21+ξ22
Wherein, CHI (fi, fj) indicate feature fi, fjχ2Statistical value, ξ11Indicate feature fiWith feature fjThe reason occurred simultaneously
By the deviation of value and actual value;ξ12Indicate feature fiDo not occur feature f in the sample of appearancejTheoretical value and actual value it is inclined
Difference;ξ21Expression does not occur feature fiSample number in there is feature fjTheoretical value and actual value deviation;ξ22Expression does not occur
Feature fiSample number in do not occur feature f yetjTheoretical value and actual value deviation.
χ2Statistical value calculation formula specific explanations are as follows:
χ2Statistics is using the deviation of actual value and theoretical value come the degree of correlation between measures characteristic and classification.Assuming that two
A feature fiAnd fj, two features simultaneously occur sample number beWhile the sample number not occurred isFeature fiOccur
And fjThe sample number not occurred isFeature fiDo not occur and fjThe sample number of appearance isPhysical relationship between them
It is as shown in table 1:
1 feature distribution table of table
Wherein, N is total number of samples, and value is the sum of four kinds of situations, i.e., By
This, can obtain, feature fiThe frequency of appearance are as follows:
Feature fjThe sample number of appearance isTheoretically there is feature fjSample in, also there is feature fi's
Sample number are as follows:
So, feature fiAnd fjSimultaneously occur to theoretical value and actual value deviation ξ11Are as follows:
Similarly, feature f can be acquirediOccur and feature fjThe theoretical sample number E not occurred12, feature fiDo not occur and feature fj
The theoretical sample number E of appearance21, feature fiWith feature fjThe theoretical sample number E all not occurred22And their theoretical value and reality
The deviation ξ of actual value12、ξ21、ξ22, calculation formula is as follows:
Therefore, two feature fiAnd fjχ2Statistical value is deviation ξ11、ξ12、ξ21、ξ22The sum of, i.e.,
S405 step 5: it generates optimal feature subset: the de-redundancy character subset is analyzed and determined, and according to sentencing
Determine result and carry out further operating, obtains optimal feature subset to the end.
Wherein, the step 1 specifically:
Note sample set eigenmatrix is Xtrain, the feature set selected is Fv, the Characteristic Number selected is Mv.Setting information increases
The initial threshold of benefit is a certain particular value θig, setting information gain step size is λ, and setting information gain recycles step number n=0, setting
Verification and measurement ratio threshold value is 0.95.Using machine learning classification algorithm to primitive character matrix XtrainIt is trained, writes down maximum detection
Rate is TPmax。
As shown in figure 5, the step 2 specifically includes:
S501: the characteristic frequency of each feature in all sample set characteristic sets is calculated;
S502: filtering out the feature that characteristic frequency value is 0, and remaining feature forms intermediate features subset
S503: by machine learning classification algorithm to intermediate character subsetCorresponding eigenmatrix is trained, and is obtained
Corresponding verification and measurement ratio TPtf;
S504: intermediate features subset is filtered outThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic subsetBy machine learning classification algorithm to character subsetCorresponding eigenmatrix is trained, and is detected accordingly
Rate TPtf′;
S505: compare TPtfWith TPtf' value, if TPtf=TPtf', then by character subsetIt is denoted as new intermediate spy
Levy subsetReturn step S503;If TPtf≠TPtf', export intermediate features subset
S506: the character subset of output is denoted as Fv1, the Characteristic Number selected is Mv1, the character subset Fv1It is i.e. described
Go uncorrelated features subset.
As shown in fig. 6, the step 3 specifically includes:
S601: the information gain of each feature in uncorrelated features subset is gone described in calculating;
S602: circulation step number adds 1, i.e. n=n+1 on the original basis;
S603: it selects and meets IG > (θig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is remembered
ForIt selects and meets IG > (θig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as
S604: compareWithValue, ifReturn step S602;If
Export character subset
S605: the character subset of output is denoted as Fv2, the Characteristic Number selected is Mv2, the character subset Fv2It is i.e. described
Denoising character subset.
As shown in fig. 7, the step 4 specifically includes:
S701: character subset F is calculatedv2In each feature and corresponding eigenmatrix CHI value (χ2Statistical value), it will wherein
Maximum CHI value is denoted as θchi;
S702: calculating the CHI value between feature, and CHI value is greater than θ between selecting featurechiFeature pair, and will wherein IG value compared with
Small feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for
S703: setting circulation step number m=0;
S704: circulation step number adds 1, i.e. m=m+1 on the original basis;
S705: according to redundancy feature collectionF is rejected in putting in order for middle redundancy featurev2In a redundancy feature, obtain
To character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subsetIt is corresponding
Eigenmatrix is trained, and obtains corresponding verification and measurement ratio
S706: compare m withValue, ifReturn step S704;Otherwise it performs the next step;
S707: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum verification and measurement ratioCorresponding character subset is denoted as Fv3, the Characteristic Number selected is Mv3, the character subset Fv3That is de-redundancy feature
Collection.
Wherein, the step 5 generates optimal feature subset method specifically:
By gained verification and measurement ratio in the step 4With maximum verification and measurement ratio TPmaxIt is compared, it will be larger in the two
Value is assigned to TPmax;By TPmaxCompared with initially setting verification and measurement ratio threshold value 0.95, if TPmax< 0.95, then return step three;If
TPmax>=0.95, then TPmaxCorresponding character subset, that is, optimal feature subset, is denoted as F for itv。
As shown in figure 8, in a kind of other embodiments of malware detection method provided by the invention, the generation
Detection model method specifically:
Verification and measurement ratio threshold value is set first, is then utilized respectively bayesian algorithm (NB), algorithm of support vector machine (SVM), determines
Plan tree algorithm (DT) and arest neighbors sorting algorithm (KNN) are trained permission and sensitive API feature, according to the verification and measurement ratio of setting
Threshold value selects optimal detection model output.
As shown in figure 8, in a kind of other embodiments of malware detection method provided by the invention, the basis
The method that the detection threshold value of setting selects optimal detection model output specifically:
If the verification and measurement ratio of the detection model as obtained by training is not less than threshold value, output phase answers detection model;
If the verification and measurement ratio of the detection model as obtained by training is lower than threshold value, changes the combination of feature, instruct again
New detection model is got, until meeting threshold requirement, output meets the detection model of threshold requirement;
If traversing all possible feature combination, gained detection model verification and measurement ratio is not able to satisfy threshold requirement yet,
Then export the highest detection model of verification and measurement ratio in ergodic process.
As shown in table 2, be in a kind of embodiment of malware detection method provided by the invention to detection performance into
The result of row test.
2 detection performance result of table
Specific implementation method are as follows:
Using peace intelligence in the market 5000 normal softwares through detecting and upper 5000 Malwares of VirusShare as
Sample set, is tested that (it is similar that sample data is divided into 10 sizes by 10 folding cross validations using 10 folding cross-validation methods
Exclusive subsets, use the union of 9 subsets as training set every time, that subset of remainder is as test set, thus progress 10
Secondary training and test, finally obtained is the mean value of this 10 test results).
Feature selecting algorithm to unused feature selecting algorithm and based on characteristic frequency, information gain and statistics respectively
Detection performance compares and analyzes.
Wherein, the meaning of performance indicator is as follows
(1) TPR (verification and measurement ratio) is the ratio of classifier final classification correct positive example and practical positive example, and TPR is bigger, shows
Classifier is better to positive example classifying quality.Calculation formula is as follows:
(2) FPR (rate of false alarm) is the positive example of classifier final classification mistake and the ratio of practical counter-example, and FPR is bigger, shows
Classifier is poorer to counter-example classifying quality.Calculation formula is as follows:
(3) Acc (accuracy rate) is the ratio of classifier final all classification correct samples and total sample, presentation class device
Accurate classification degree, Acc is bigger, shows that the whole classification capacity of the classifier is better.Calculation formula is as follows:
In formula, TP (real example) is the number that the sample that truth is positive is detected as positive example, i.e. detection is correct just
Example;It is the number that anti-sample is detected as positive example that FP (false positive example), which is truth, that is, detects the counter-example of mistake;FN is (false anti-
Example) it is that the sample that truth is positive is detected as the number of counter-example, the i.e. positive example of classification error;TN (true counter-example) is true feelings
Condition is the number that anti-sample is detected as counter-example, that is, detects correct counter-example.
It can be seen that malware detection method proposed by the present invention according to testing inspection result and use feature selecting algorithm
After software features are in optimized selection, software features are trained using different machine learning classification algorithms in four and are obtained
The verification and measurement ratio of detection model, accuracy rate is above the malware detection method of traditional unused feature selecting algorithm and obtains
The verification and measurement ratio of detection model, accuracy rate, and the rate of false alarm for the detection model that malware detection method proposed by the present invention obtains
It is below the rate of false alarm for the detection model that traditional malware detection method obtains.
This illustrates that a kind of detection efficiency of malware detection method proposed by the present invention is higher, and detection effect is more preferable.
Another aspect of the present invention provides a kind of malware detection device.
In a kind of some embodiments of malware detection device provided by the invention, described device includes:
Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information is abstracted and turns to number
Form obtains sample set characteristic set and sample set eigenmatrix;
Subset generation module: for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm,
Obtain optimal feature subset;
Detection model generation module: for using machine learning classification algorithm to the corresponding feature of the optimal feature subset
Matrix is trained, and generates detection model.
Another aspect of the present invention provides a kind of malware detection electronic equipment.
The present invention provides in some embodiments of malware detection electronic equipment, the electronic equipment includes:
Memory, processor and storage are on a memory and the computer program that can run on a processor.
The processor realizes malware detection method provided by the present invention when executing described program.
The device of above-described embodiment, for realizing method corresponding in previous embodiment, and has corresponding with electronic equipment
Embodiment of the method beneficial effect, details are not described herein.
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not
It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under thinking of the invention, above embodiments
Or can also be combined between the technical characteristic in different embodiments, step can be realized with random order, and be existed such as
Many other variations of the upper different aspect of the invention, for simplicity, they are not provided in details.
In addition, to simplify explanation and discussing, and in order not to obscure the invention, it can in provided attached drawing
It is connect with showing or can not show with the well known power ground of integrated circuit (IC) chip and other components.Furthermore, it is possible to
Device is shown in block diagram form, to avoid obscuring the invention, and this has also contemplated following facts, i.e., about this
The details of the embodiment of a little block diagram arrangements be height depend on will implementing platform of the invention (that is, these details should
It is completely within the scope of the understanding of those skilled in the art).Elaborating that detail (for example, circuit) is of the invention to describe
In the case where exemplary embodiment, it will be apparent to those skilled in the art that can be in these no details
In the case where or implement the present invention in the case that these details change.Therefore, these descriptions should be considered as explanation
Property rather than it is restrictive.
Although having been incorporated with specific embodiments of the present invention, invention has been described, according to retouching for front
It states, many replacements of these embodiments, modifications and variations will be apparent for those of ordinary skills.Example
Such as, discussed embodiment can be used in other memory architectures (for example, dynamic ram (DRAM)).
The embodiment of the present invention be intended to cover fall into all such replacements within the broad range of appended claims,
Modifications and variations.Therefore, all within the spirits and principles of the present invention, any omission, modification, equivalent replacement, the improvement made
Deng should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of malware detection method characterized by comprising
The characteristic information is abstracted and turns to digital form, obtains sample set feature set by the characteristic information for extracting sample set software
It closes and sample set eigenmatrix;
The invalid feature in the characteristic set is filtered using feature selecting algorithm, obtains optimal feature subset;
The corresponding eigenmatrix of the optimal feature subset is trained using machine learning classification algorithm, generates detection mould
Type.
2., will be described the method according to claim 1, wherein the characteristic information for extracting sample set software
Characteristic information is abstract to turn to digital form, and obtaining sample set characteristic set with sample set eigenmatrix includes:
Sample set software installation packet is handled, obtains the global configuration file comprising authority information and comprising API information
Decompiling file;
Corresponding permission and API characteristic information are extracted from the global configuration file and the decompiling file;
The permission extracted and API characteristic information vectorization are abstracted as digital form, obtain sample set characteristic set with
Sample set eigenmatrix.
3. the method according to claim 1, wherein described filter the sample set spy using feature selecting algorithm
Invalid feature in collection conjunction, obtaining optimal feature subset includes:
Step 1: to relevant parameter constant in the sample set characteristic set and the sample set eigenmatrix and in subset life
Initialize installation is carried out at the relevant parameter used in the process;
Step 2: the characteristic frequency of each feature in sample set characteristic set is calculated according to characteristic frequency calculation formula, passes through meter
Calculation, which is compared, filters off incoherent feature, obtains uncorrelated features subset;
The characteristic frequency calculation formula:
Wherein, TF (fj) indicate feature fjCharacteristic frequency, NbenignIndicate that normal software concentrates normal sample number,Table
Show feature fjThe sample number of appearance;NmalwareIndicate malice sample number in malice sample set,It is characterized fjOccur
Sample number;
Step 3: going the information gain of each feature in uncorrelated features subset according to the calculating of information gain calculation formula,
Compare screening by calculating and obtains denoising character subset;
The information gain calculation formula:
IG(fj)=H (Y)-H (Y | fj)
Wherein, IG (fj) indicate feature fjTo the information gain of categorizing system, the entropy of H (Y) presentation class system, H (Y | fj) indicate
The conditional entropy of categorizing system;
Step 4: according to χ2Statistical value calculation formula calculates each feature and corresponding feature square in the denoising character subset
CHI value (the χ of battle array2Statistical value) and feature between CHI value, compare screening by calculating and obtain de-redundancy character subset;
The χ2Statistical value calculation formula:
CHI(fi, fj)=ξ11+ξ12+ξ21+ξ22
Wherein, CHI (fi, fj) indicate feature fi, fjχ2Statistical value, ξ11Indicate feature fiWith feature fjThe theoretical value occurred simultaneously
And the deviation of actual value, ξ12Indicate feature fiDo not occur feature f in the sample of appearancejTheoretical value and actual value deviation, ξ21
Expression does not occur feature fiSample number in there is feature fjTheoretical value and actual value deviation, ξ22Expression does not occur feature fi
Sample number in do not occur feature f yetjTheoretical value and actual value deviation;
Step 5: analyzing and determining the de-redundancy character subset, and according to result progress subset optimization is determined, obtains most
Good character subset.
4. according to the method described in claim 3, it is characterized in that, the step 1 specifically:
Remember that the sample set characteristic set is Fv, the sample set eigenmatrix is Xtrain, the Characteristic Number selected is Mv;Setting
The initial threshold of information gain is a certain particular value θig, setting information gain step size is λ, and it is initial that setting information gain recycles step number
Value n=0 sets verification and measurement ratio threshold value as 0.95;Using machine learning classification algorithm to the sample set eigenmatrix XtrainIt carries out
Training, writing down maximum verification and measurement ratio is TPmax。
5. according to the method described in claim 4, it is characterized in that, the step 2 includes:
Step 1: calculating the characteristic frequency of each feature in the sample set characteristic set;
Step 2: filtering out the feature that characteristic frequency value is 0, remaining feature forms intermediate features subset
Step 3: by machine learning classification algorithm to the intermediate features subsetCorresponding eigenmatrix is trained, and is obtained
To corresponding verification and measurement ratio TPtf;
Step 4: filtering out the intermediate features subsetThe middle the smallest feature of characteristic frequency, remaining feature composition characteristic subsetBy machine learning classification algorithm to the character subsetCorresponding eigenmatrix is trained, and is obtained corresponding
Verification and measurement ratio TPtf′;
Step 5: comparing TPtfWith TPtf' value, if TPtf=TPtf', then by the character subsetIt is denoted as new intermediate spy
Levy subsetReturn step 3;If TPtf≠TPtf', export the intermediate features subset
Step 6: by the intermediate features subsetIt is denoted as character subset Fv1, the Characteristic Number selected is Mv1, the character subset
Fv1It is i.e. described to go uncorrelated features subset.
6. according to the method described in claim 5, it is characterized in that, the step 3 includes:
Step 1: the information gain of each feature in uncorrelated features subset is gone described in calculating;
Step 2: circulation step number adds 1, i.e. n=n+1 on the original basis;
Step 3: selecting and meet IG > (θig(n-1) λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted asIt selects and meets IG > (θig- n λ) feature composition characteristic subsetThe Characteristic Number of selection is denoted as
Step 4: comparingWithValue, ifReturn step 2;IfOutput is special
Levy subset
Step 5: by the character subset of outputIt is denoted as character subset Fv2, the Characteristic Number selected is Mv2, feature
Collect Fv2The i.e. described denoising character subset.
7. according to the method described in claim 6, it is characterized in that, the step 4 includes:
Step 1: calculating the denoising character subset Fv2In each feature and corresponding eigenmatrix CHI value, will wherein most
Big CHI value is denoted as θchi;
Step 2: calculating the CHI value between feature, CHI value is greater than θ between selecting featurechiFeature pair, and will wherein IG value it is lesser
Feature is selected, and rearranges redundancy feature collection from big to small according to CHI valueThe redundancy feature number selected for
Step 3: setting circulation step number m=0;
Step 4: circulation step number adds 1, i.e. m=m+1 on the original basis;
Step 5: according to the redundancy feature collectionF is rejected in putting in order for middle redundancy featurev2In a redundancy feature, obtain
To character subsetThe Characteristic Number selected forBy machine learning classification algorithm to character subsetIt is corresponding
Eigenmatrix is trained, and obtains corresponding verification and measurement ratio
Step 6: compare m withValue, ifReturn step 4;Otherwise it performs the next step;
Step 7: more all verification and measurement ratiosMaximum verification and measurement ratio is denoted asMaximum verification and measurement ratioCorresponding character subset is denoted as Fv3, the Characteristic Number selected is Mv3, the character subset Fv3That is de-redundancy feature
Collection.
8. the method according to the description of claim 7 is characterized in that the step 5 specifically:
By verification and measurement ratio described in the step 4With the maximum verification and measurement ratio TPmaxIt is compared, by the larger value in the two
It is assigned to TPmax;By TPmaxCompared with initially setting verification and measurement ratio threshold value 0.95, if TPmax< 0.95, then return to the step 3;If
TPmax>=0.95, then TPmaxCorresponding character subset is optimal feature subset, and note optimal feature subset is Fv。
9. a kind of malware detection device, which is characterized in that described device includes:
Characteristic extracting module: for extracting the characteristic information of sample set software, the characteristic information being abstracted and turns to digital form,
Obtain sample set characteristic set and sample set eigenmatrix;
Subset generation module: it for filtering the invalid feature in the sample set characteristic set using feature selecting algorithm, obtains
Optimal feature subset;
Detection model generation module: for using machine learning classification algorithm to the corresponding eigenmatrix of the optimal feature subset
It is trained, generates detection model.
10. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor
Machine program, which is characterized in that the processor realizes the side as described in claim 1 to 8 any one when executing described program
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811495637.7A CN109784046B (en) | 2018-12-07 | 2018-12-07 | Malicious software detection method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811495637.7A CN109784046B (en) | 2018-12-07 | 2018-12-07 | Malicious software detection method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109784046A true CN109784046A (en) | 2019-05-21 |
CN109784046B CN109784046B (en) | 2021-02-02 |
Family
ID=66495778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811495637.7A Active CN109784046B (en) | 2018-12-07 | 2018-12-07 | Malicious software detection method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109784046B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110464345A (en) * | 2019-08-22 | 2019-11-19 | 北京航空航天大学 | A kind of separate head bioelectrical power signal interference elimination method and system |
CN110955895A (en) * | 2019-11-29 | 2020-04-03 | 珠海豹趣科技有限公司 | Operation interception method and device and computer readable storage medium |
CN110990834A (en) * | 2019-11-19 | 2020-04-10 | 重庆邮电大学 | Static detection method, system and medium for android malicious software |
CN112632539A (en) * | 2020-12-28 | 2021-04-09 | 西北工业大学 | Dynamic and static mixed feature extraction method in Android system malicious software detection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2128798A1 (en) * | 2008-05-27 | 2009-12-02 | Deutsche Telekom AG | Unknown malcode detection using classifiers with optimal training sets |
CN104298715A (en) * | 2014-09-16 | 2015-01-21 | 北京航空航天大学 | TF-IDF based multiple-index result merging and sequencing method |
CN105320887A (en) * | 2015-10-12 | 2016-02-10 | 湖南大学 | Static characteristic extraction and selection based detection method for Android malicious application |
CN107577942A (en) * | 2017-08-22 | 2018-01-12 | 中国民航大学 | A kind of composite character screening technique for Android malware detection |
-
2018
- 2018-12-07 CN CN201811495637.7A patent/CN109784046B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2128798A1 (en) * | 2008-05-27 | 2009-12-02 | Deutsche Telekom AG | Unknown malcode detection using classifiers with optimal training sets |
CN104298715A (en) * | 2014-09-16 | 2015-01-21 | 北京航空航天大学 | TF-IDF based multiple-index result merging and sequencing method |
CN105320887A (en) * | 2015-10-12 | 2016-02-10 | 湖南大学 | Static characteristic extraction and selection based detection method for Android malicious application |
CN107577942A (en) * | 2017-08-22 | 2018-01-12 | 中国民航大学 | A kind of composite character screening technique for Android malware detection |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110464345A (en) * | 2019-08-22 | 2019-11-19 | 北京航空航天大学 | A kind of separate head bioelectrical power signal interference elimination method and system |
CN110990834A (en) * | 2019-11-19 | 2020-04-10 | 重庆邮电大学 | Static detection method, system and medium for android malicious software |
CN110990834B (en) * | 2019-11-19 | 2022-12-27 | 重庆邮电大学 | Static detection method, system and medium for android malicious software |
CN110955895A (en) * | 2019-11-29 | 2020-04-03 | 珠海豹趣科技有限公司 | Operation interception method and device and computer readable storage medium |
CN110955895B (en) * | 2019-11-29 | 2022-03-29 | 珠海豹趣科技有限公司 | Operation interception method and device and computer readable storage medium |
CN112632539A (en) * | 2020-12-28 | 2021-04-09 | 西北工业大学 | Dynamic and static mixed feature extraction method in Android system malicious software detection |
CN112632539B (en) * | 2020-12-28 | 2024-04-09 | 西北工业大学 | Dynamic and static hybrid feature extraction method in Android system malicious software detection |
Also Published As
Publication number | Publication date |
---|---|
CN109784046B (en) | 2021-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109784046A (en) | A kind of malware detection method, apparatus and electronic equipment | |
US9298912B2 (en) | System and method for distinguishing human swipe input sequence behavior and using a confidence value on a score to detect fraudsters | |
Gou et al. | Improved pseudo nearest neighbor classification | |
Mandhare et al. | A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques | |
CN112733749A (en) | Real-time pedestrian detection method integrating attention mechanism | |
CN109145765B (en) | Face detection method and device, computer equipment and storage medium | |
CN109886284B (en) | Fraud detection method and system based on hierarchical clustering | |
Yang et al. | A deep multiscale pyramid network enhanced with spatial–spectral residual attention for hyperspectral image change detection | |
CN108022146A (en) | Characteristic item processing method, device, the computer equipment of collage-credit data | |
CN110084609B (en) | Transaction fraud behavior deep detection method based on characterization learning | |
CN103473556A (en) | Hierarchical support vector machine classifying method based on rejection subspace | |
CN112437053B (en) | Intrusion detection method and device | |
CN111062036A (en) | Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment | |
Halim et al. | Recurrent neural network for malware detection | |
CN103164710A (en) | Selection integrated face identifying method based on compressed sensing | |
CN102324007A (en) | Method for detecting abnormality based on data mining | |
CN111753299A (en) | Unbalanced malicious software detection method based on packet integration | |
Tebyanian et al. | SC-COTD: Hardware trojan detection based on sequential/combinational testability features using ensemble classifier | |
Shukla et al. | A unique approach for detection of fake news using machine learning | |
CN116260565A (en) | Chip electromagnetic side channel analysis method, system and storage medium | |
Cui et al. | Hardware trojan detection based on cluster analysis of mahalanobis distance | |
CN110390215A (en) | A kind of hardware Trojan horse detection method and system based on raising activation probability | |
Shrivastava et al. | A SVM and K-means clustering based fast and efficient intrusion detection system | |
Li et al. | A novel fingerprint indexing approach focusing on minutia location and direction | |
CN107898458A (en) | Single examination time brain electricity P300 component detection methods and device based on image prior |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |