CN109784046B - Malicious software detection method and device and electronic equipment - Google Patents
Malicious software detection method and device and electronic equipment Download PDFInfo
- Publication number
- CN109784046B CN109784046B CN201811495637.7A CN201811495637A CN109784046B CN 109784046 B CN109784046 B CN 109784046B CN 201811495637 A CN201811495637 A CN 201811495637A CN 109784046 B CN109784046 B CN 109784046B
- Authority
- CN
- China
- Prior art keywords
- feature
- subset
- features
- characteristic
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a malicious software detection method, a malicious software detection device and electronic equipment, relates to the field of security protection of mobile terminals, can effectively detect malicious software, and solves the problems of redundancy, irrelevance and noise existing in the extraction of malicious software features in the prior art. The malware detection method comprises the following steps: extracting characteristics; generating a subset; and generating a detection model. The malware detection apparatus includes: the device comprises a feature extraction module, a subset generation module and a detection model generation module. The electronic equipment comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the processor realizes the malicious software detection method when executing the program.
Description
Technical Field
The present invention relates to security protection of a mobile terminal, and in particular, to a method and an apparatus for detecting malicious software, and an electronic device.
Background
The mobile intelligent terminal is a general name of various mobile terminals with network access capability and provided with an operating system and an application program. The great popularization and powerful functions of the mobile intelligent terminal including the Android system make the mobile intelligent terminal become an irreplaceable tool in various fields of modern society. Meanwhile, along with the mobile intelligent terminal, malicious software is rampant gradually, and the malicious software destroys a user system in an undetected state, steals user data and charges, and seriously threatens the privacy and property safety of a user. More seriously, relevant confidential information of the country, such as economy, politics, military affairs and the like, is threatened, and the security of the country is damaged. In order to deal with the increasing threat of malicious software of a mobile intelligent terminal and meet the detection requirement of the future mobile intelligent terminal on unknown malicious software, a detection method for Android malicious software is needed.
The existing detection method adopting machine learning starts from the artificial intelligence perspective, the classification algorithm is utilized to learn the characteristics of known malicious software, and an continuously evolving and generalized intelligent monitoring model is constructed to realize the automatic intelligent detection of the Android software. The key of the detection method is the selection of the characteristics, and the more effectively the selected characteristics can distinguish the malicious software from the normal software, the higher the efficiency of an intelligent detection model obtained by utilizing a machine learning classification algorithm is, and the better the detection effect on the malicious software is. However, the features of malware extracted by the existing method have the problems of redundancy, irrelevance and noise: the redundancy of the features influences the calculation efficiency of the classification algorithm and reduces the effectiveness of the detection model; the irrelevance of the characteristics leads to the need of more training samples to obtain a proper detection model; noise interference of features can directly lead to the construction of wrong detection models. The above-mentioned problems can greatly increase the consumption of machine learning in time and space, thereby causing the classification algorithm to be completely ineffective in analyzing and processing the features due to high cost.
Disclosure of Invention
In view of this, the present invention provides a detection method, an apparatus and an electronic device, which can meet the detection requirement of a mobile terminal on unknown malware and solve the problems of redundancy, irrelevance and noise faced by feature selection.
Based on the above purpose, the present invention provides a malware detection method. The malware detection method comprises the following steps:
extracting feature information of sample set software, abstracting the feature information into a digital form, and obtaining a sample set feature set and a sample set feature matrix;
filtering invalid features in the feature set by using a feature selection algorithm to obtain an optimal feature subset;
and training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a detection model.
Optionally, the extracting feature information of the sample set software, and abstracting the feature information into a digital form to obtain a sample set feature set and a sample set feature matrix includes:
processing the sample set software installation package to obtain a global configuration file containing authority information and a decompiled file containing API information;
extracting corresponding authority and API characteristic information from the global configuration file and the decompilated file;
and vectorizing and abstracting the extracted authority and API characteristic information into a digital form to obtain a sample set characteristic set and a sample set characteristic matrix.
Optionally, the filtering invalid features in the sample set feature set by using a feature selection algorithm to obtain an optimal feature subset includes:
the method comprises the following steps: initializing and setting the sample set feature set and the related parameter constants in the sample set feature matrix and the related parameters used in the subset generation process;
step two: calculating the characteristic frequency of each characteristic in the sample set characteristic set according to a characteristic frequency calculation formula, and filtering out irrelevant characteristics through calculation and comparison to obtain a irrelevant characteristic removing subset;
the characteristic frequency calculation formula is as follows:
wherein, TF (f)j) Representing a feature fjCharacteristic frequency of (1), NbenignIndicating a normal number of samples in a normal software set,representing a feature fjThe number of samples present; n is a radical ofmalwareRepresenting the number of malicious samples in the set of malicious samples,is characterized byjThe number of samples present;
step three: calculating the information gain of each feature in the decorrelation feature subset according to an information gain calculation formula, and obtaining a denoising feature subset through calculation, comparison and screening;
the information gain calculation formula is as follows:
IG(fj)=H(Y)-H(Y|fj)
wherein, IG (f)j) Representing a feature fjInformation gain for classification system, H (Y) represents entropy of classification system, H (Y | f)j) Representing the conditional entropy of the classification system;
step four: according to chi2A statistic calculation formula for calculating CHI value (χ) of each feature in the de-noised feature subset and the corresponding feature matrix2Statistic value) and CHI value between the features, and obtaining redundancy-removing feature subsets through calculation, comparison and screening;
the x2The statistical value calculation formula is as follows:
CHI(fi,fj)=ξ11+ξ12+ξ21+ξ22
wherein CHI (f)i,fj) Representing a feature fi,fjChi of2Statistical value xi11Representing a feature fiAnd feature fjSimultaneous deviation of theoretical and actual values, ξ12Representing a feature fiThe feature f does not appear in the appearing samplejDeviation of the theoretical value from the actual value, ξ21Indicates the absence of feature fiIs present in the number of samples of (a) to (b)jDeviation of the theoretical value from the actual value, ξ22Indicates the absence of feature fiHas no feature f in the number of samplesjDeviation of the theoretical value from the actual value of (a);
step five: and analyzing and judging the redundancy-removing feature subset, and performing subset optimization according to a judgment result to obtain an optimal feature subset.
Optionally, the step one specifically includes:
characterizing said sample set as FvThe sample set feature matrix is XtrainThe number of selected features is Mv(ii) a Setting an initial threshold value of information gain to a specific value thetaigSetting the information gain step length as lambda and setting the information gain cycle step numberSetting the initial value n to be 0, and setting the detection rate threshold value to be 0.95; utilizing a machine learning classification algorithm to perform a feature matrix X on the sample settrainTraining is carried out, and the maximum detection rate is recorded as TPmax。
Optionally, the second step includes:
step 1: calculating the characteristic frequency of each characteristic in the characteristic set of the sample set;
step 2: filtering out the features with the feature frequency value of 0, wherein the rest features form an intermediate feature subset
And step 3: the intermediate feature subsets are classified by machine learningTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf;
And 4, step 4: filtering out the intermediate feature subsetFeatures with the smallest medium feature frequency, the remaining features constituting a subset of featuresThe feature subsets are classified by machine learningTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf′;
And 5: comparison of TPtfAnd TPtfValue of' if TPtf=TPtf', then subset the featuresIntermediate feature subset as newReturning to the step 3; if TPtf≠TPtf' outputting the intermediate feature subset
Step 6: subset the intermediate featuresIs expressed as a feature subset Fv1The number of selected features is Mv1Said subset of features Fv1I.e. the decorrelated feature subset.
Optionally, the third step includes:
step 1: calculating an information gain for each feature in the decorrelated subset of features;
step 2: adding 1 to the number of the circulation steps on the basis of the original number, namely n is n + 1;
and step 3: select out of satisfaction IG>(θigFeatures of- (n-1) lambda) constitute a subset of featuresThe number of selected features is recorded asSelect out of satisfaction IG>(θig-n λ) of features constituting a feature subsetThe number of selected features is recorded as
And 4, step 4: comparisonAnda value of, ifReturning to the step 2; if it is notSubset of output features
And 5: the feature subset to be outputIs expressed as a feature subset Fv2The number of selected features is Mv2Said subset of features Fv2I.e. the subset of de-noised features.
Optionally, the fourth step includes:
step 1: computing the de-noised feature subset Fv2Each feature in (b) and the CHI value of the corresponding feature matrix, and the largest CHI value is recorded as θchi;
Step 2: calculating CHI values between features, and selecting CHI values between features larger than thetachiAnd selecting the features with smaller IG value, and arranging the features from CHI value from large to small to form a redundant feature setThe number of redundant features selected is
And step 3: setting the cycle step number m to be 0;
and 4, step 4: adding 1 to the number of the circulation steps on the basis of the original number, namely m is m + 1;
and 5: according to the redundant feature setArranged order culling of medium redundancy features Fv2To obtain a subset of featuresThe number of selected features isFeature subsets by machine learning classification algorithmsTraining the corresponding characteristic matrix to obtain the corresponding detection rate
and 7: comparing all detection ratesThe maximum detection rate was recorded asMaximum detection rateThe corresponding feature subset is denoted as Fv3The number of selected features is Mv3Said subset of features Fv3I.e. a subset of de-redundant features.
Optionally, the step five specifically includes:
comparing the detection rate in the fourth stepAnd the maximum detection rate TPmaxComparing the two values, and assigning the larger value to TPmax(ii) a The TP ismaxComparing with the initial set detection rate threshold value of 0.95 if TPmax<0.95, returning to the third step; if TPmaxGreater than or equal to 0.95, then TPmaxThe corresponding feature subset is the optimal feature subset, and the optimal feature subset is denoted as Fv。
Optionally, the method for generating a detection model specifically includes:
setting a detection rate threshold, training a feature matrix corresponding to the optimal feature subset by respectively utilizing a Bayesian algorithm, a support vector machine algorithm, a decision tree algorithm and a nearest neighbor classification algorithm, and selecting the optimal detection model to output according to the set detection rate threshold.
The method for selecting the optimal detection model according to the set detection rate threshold specifically comprises the following steps:
if the detection rate of the detection model obtained through training is not less than the threshold value, outputting the corresponding detection model;
if the detection rate of the detection model obtained through training is lower than the threshold value, changing the combination mode of the characteristics, retraining to obtain a new detection model until the threshold value requirement is met, and outputting the detection model meeting the threshold value requirement;
and if all possible feature combination modes are traversed and the detection rate of the obtained detection model still fails to meet the threshold requirement, outputting the detection model with the highest detection rate in the traversal process.
The invention also provides a malicious software detection device, which comprises:
a feature extraction module: the system comprises a sample set software, a sample set feature matrix and a sample set feature matrix, wherein the sample set software is used for extracting feature information of the sample set software, and the feature information is abstracted into a digital form to obtain the sample set feature set and the sample set feature matrix;
a subset generation module: the characteristic selection algorithm is used for filtering invalid characteristics in the sample set characteristic set to obtain an optimal characteristic subset;
a detection model generation module: and the detection model is generated by training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm.
The invention also provides electronic equipment for detecting the malicious software, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the malicious software detection method provided by the invention.
From the above, it can be seen that the malware detection method, the malware detection device and the electronic device provided by the invention can effectively detect malware by extracting the permission and sensitive API features of malware and normal software, performing optimal selection on the extracted features by using a feature selection algorithm, and training the selected permission and sensitive API combined features by using a machine learning classification algorithm. The adopted feature selection algorithm is based on feature frequency, information gain and chi2Statistical design: filtering features irrelevant to classification in the feature set by using a feature frequency method, selecting features having large influence on classification by using an information gain method, and adopting Chi2The statistical method eliminates the characteristic with high redundancy in the characteristic set. Therefore, the malicious software detection method provided by the invention can well overcome the problems of redundancy, irrelevance and noise existing in the malicious software features extracted by the prior art. The feature selection algorithm is to select feature frequency, information gain and x2The three methods are combined according to the preferred specific sequence, and compared with the method that the three methods are simply combined or one or two of the three methods are selected, the method has a better optimization selection effect, the method trains the feature subsets obtained by the feature selection algorithm by utilizing the machine learning classification algorithm, the finally obtained detection model has higher efficiency, and the detection effect on the malicious software is better.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram illustrating a malware detection method according to an embodiment of the present invention;
FIG. 2 is a block flow diagram of a malware detection method in an embodiment of the invention;
fig. 3 is a flow chart of a feature extraction method in the malware detection method in the embodiment of the present invention;
FIG. 4 is a diagram illustrating a subset generation method in the malware detection method according to an embodiment of the present invention;
FIG. 5 is a flow chart of a method for decorrelating feature frequencies in a malware detection method in an embodiment of the present invention;
FIG. 6 is a flowchart illustrating an information gain denoising method in a malware detection method according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating χ "in the malware detection method according to the embodiment of the present invention2A statistical redundancy removal method flow diagram;
fig. 8 is a schematic diagram illustrating a method for generating a detection model in a malware detection method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
In one aspect of the invention, a malware detection method is provided.
As shown in fig. 1 and 2, in some embodiments of a malware detection method provided by the present invention, the malware detection method specifically includes:
s101: and (5) feature extraction. Decompiling the software sample by using reverse engineering, extracting characteristic information, abstracting the characteristic information into a sample set characteristic set and a sample set characteristic matrix which are easy to analyze and in a digital form, and storing the sample set characteristic set and the sample set characteristic matrix into a database;
s102: and (4) generating a subset. Filtering invalid features in the sample set feature set by using a feature selection algorithm to obtain an optimal feature subset;
s103: and generating a detection model. And training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a dual-feature detection model.
As shown in fig. 3, in another embodiment of a malware detection method provided by the present invention, the feature extraction method specifically includes:
s301: processing the software APK package, decoding an android manifest xml file containing authority information into a global configuration file in a plaintext format, and inversely compiling a classes.
S302: extracting corresponding authority and API characteristic information from the global configuration file and the smali file;
s303: vectorizing and abstracting the extracted feature semantic information into a digital form to obtain a sample set feature set and a sample set feature matrix. The sample set feature matrix specifically comprises: a feature is represented in the sample by 1, the feature is represented in the sample by 0, and finally a binary sample set feature matrix is used to describe the features of the sample set, wherein the rows represent sample vectors and the columns represent feature vectors.
As shown in fig. 4, in another embodiment of a malware detection method provided by the present invention, the subset generation method specifically includes:
s401, step one: constant initialization: and initializing and setting the sample set feature set, the related parameter constants in the sample set feature matrix and the related parameters used in the subset generation process.
S402, step two: the characteristic frequencies are decorrelated: and calculating the characteristic frequency of each characteristic in the characteristic set of the sample set according to a characteristic frequency calculation formula. Filtering irrelevant features through calculation comparison to obtain a irrelevant feature removing subset;
the characteristic frequency calculation formula is as follows:
wherein, TF (f)j) Representing a feature fjThe characteristic frequency of (c). N is a radical ofbenignIndicating a normal number of samples in a normal software set,representing a feature fjThe number of samples present; n is a radical ofmalwareRepresenting the number of malicious samples in the set of malicious samples,is characterized byjNumber of samples present.
Step S403, III: information gain denoising: calculating the information gain of each feature in the decorrelation feature subset according to an information gain calculation formula, and obtaining a denoising feature subset through calculation, comparison and screening;
the information gain calculation formula is as follows:
IG(fj)=H(Y)-H(Y|fj)
wherein, IG (f)j) Representing a feature fjInformation gain for classification system, H (Y) represents entropy of classification system, H (Y | f)j) Representing the conditional entropy of the classification system.
The information gain calculation formula is specifically explained as follows:
the probability of the occurrence of a normal software sample is P (c)0) The probability of the occurrence of a malware sample is P (c)1) Then the entropy of the classification system is defined as:
given characteristic fjConditional probability of each class P (c) of occurrencei|fj1), the conditional entropy of the classification system is defined as:
then, characteristic fjWhen not present, the entropy of the classification system is defined as:
wherein the probability P (c)i) Has a value of ciThe proportion of the number of class samples to the total number of training samples; probability P (f)jThe value of 1) is the occurrence characteristic fjThe ratio of the number of samples to the total number of samples, the probability P (f)j0) is the characteristic fjThe number of non-appearing samples is the ratio of the total number of samples.
Thus, characteristic fjInformation gain IG (f) for classification systemj) The calculation formula of (a) is as follows:
s404, step four: chi shape2And (3) statistical redundancy removal: according to chi2A statistic calculation formula for calculating CHI value (χ) of each feature in the de-noised feature subset and the corresponding feature matrix2Statistical value) and the CHI value between features. Obtaining redundancy-removing characteristic subsets through calculation, comparison and screening;
two characteristics fi,fjChi of2The statistical value calculation formula is as follows:
CHI(fi,fj)=ξ11+ξ12+ξ21+ξ22
wherein CHI (f)i,fj) Representing a feature fi,fjChi of2Statistical value xi11Representing a feature fiAnd feature fjDeviations of the theoretical value and the actual value occurring at the same time; xi12Representing a feature fiThe feature f does not appear in the appearing samplejDeviation of the theoretical value from the actual value of (a); xi21Indicates the absence of feature fiIs present in the number of samples of (a) to (b)jDeviation of the theoretical value from the actual value of (a); xi22Indicates the absence of feature fiHas no feature f in the number of samplesjDeviation of the theoretical value from the actual value.
χ2The statistical value calculation formula is specifically explained as follows:
χ2the statistics being based on actual and theoretical valuesThe deviation measures the degree of correlation between the features and the categories. Suppose two features fiAnd fjThe number of samples in which both features occur simultaneously isThe number of simultaneously non-appearing samples isCharacteristic fiIs present and fjThe number of samples not appearing isCharacteristic fiNot present but fjThe number of samples appearing isThe specific relationship between them is shown in table 1:
TABLE 1 characteristic distribution Table
Where N is the total number of samples, the value of which is the sum of the four cases, i.e. Thus, the feature f can be obtainediThe frequency of occurrence is:
characteristic fjThe number of samples appearing isThe characteristic f appears theoreticallyjIn the sample of (2), the feature f also appearsiThe number of samples of (a) is:
then, characteristic fiAnd fjDeviation xi of given theoretical value and actual value simultaneously11Comprises the following steps:
in the same way, the feature f can be obtainediOccurrence of characteristic fjNumber of non-existent theoretical samples E12Characteristic fiNot present but feature fjNumber of theoretical samples E present21Characteristic fiAnd characteristic fjNumber of theoretical samples E that did not appear22And their theoretical and actual values deviate ξ12、ξ21、ξ22The calculation formula is as follows:
thus, two characteristics fiAnd fjChi of2The statistical value is deviation xi11、ξ12、ξ21、ξ22To sum, i.e.
S405, step five: generating an optimal feature subset: and analyzing and judging the redundancy-removed feature subset, and performing further operation according to a judgment result to obtain a final optimal feature subset.
Wherein, the first step is specifically as follows:
recording the sample set characteristic matrix as XtrainThe selected feature set is FvThe number of selected features is Mv. Setting an initial threshold value of information gain to a specific value thetaigThe information gain step is set to λ, the information gain cycle step number n is set to 0, and the detection rate threshold is set to 0.95. Using machine learning classification algorithm to carry out on original feature matrix XtrainTraining is carried out, and the maximum detection rate is recorded as TPmax。
As shown in fig. 5, the second step specifically includes:
s501: calculating the characteristic frequency of each characteristic in all the sample set characteristic sets;
s502: filtering out the features with the feature frequency value of 0, wherein the rest features form an intermediate feature subset
S503: intermediate feature subsets by machine learning classification algorithmsTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf;
S504: filtering out intermediate feature subsetsFeatures with the smallest medium feature frequency, the remaining features constituting a subset of featuresFeature subsets by machine learning classification algorithmsTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf′;
S505: comparison of TPtfAnd TPtfValue of' if TPtf=TPtf', then subset the featuresIntermediate feature subset as newReturning to step S503; if TPtf≠TPtf', output the intermediate feature subset
S506: denote the feature subset of the output as Fv1The number of selected features is Mv1Said subset of features Fv1I.e. the decorrelated feature subset.
As shown in fig. 6, the third step specifically includes:
s601, calculating the information gain of each feature in the decorrelation feature subset;
s602: adding 1 to the number of the circulation steps on the basis of the original number, namely n is n + 1;
s603: select out of satisfaction IG>(θigFeatures of- (n-1) lambda) constitute a subset of featuresThe number of selected features is recorded asSelect out of satisfaction IG>(θig-n λ) of features constituting a feature subsetThe number of selected features is recorded as
S605: denote the feature subset of the output as Fv2The number of selected features is Mv2Said subset of features Fv2I.e. the subset of de-noised features.
As shown in fig. 7, the fourth step specifically includes:
s701: computing a feature subset Fv2Each feature in (1) is associated with a corresponding feature matrix CHI value (χ)2Statistical value) of the values, the greatest CHI value among the values is represented as θchi;
S702: calculating CHI values between features, and selecting CHI values between features larger than thetachiAnd selecting the features with smaller IG value, and arranging the features from CHI value from large to small to form a redundant feature setThe number of redundant features selected is
S703: setting the cycle step number m to be 0;
s704: adding 1 to the number of the circulation steps on the basis of the original number, namely m is m + 1;
s705: from a set of redundant featuresArranged order culling of medium redundancy features Fv2To obtain a subset of featuresThe number of selected features isFeature subsets by machine learning classification algorithmsTraining the corresponding characteristic matrix to obtain the corresponding detection rate
s707: comparing all detection ratesThe maximum detection rate was recorded asMaximum detection rateThe corresponding feature subset is denoted as Fv3The number of selected features is Mv3Said subset of features Fv3I.e. a subset of de-redundant features.
The method for generating the optimal feature subset in the fifth step specifically comprises the following steps:
the detection rate obtained in the fourth stepAnd maximum detection rate TPmaxComparing the two values, and assigning the larger value to TPmax(ii) a The TP ismaxComparing with the initial set detection rate threshold value of 0.95 if TPmax<0.95, returning to the third step; if TPmaxGreater than or equal to 0.95, then TPmaxThe corresponding feature subset, i.e. the best feature subset, is denoted as Fv。
As shown in fig. 8, in another embodiment of the malware detection method provided by the present invention, the method for generating the detection model specifically includes:
firstly, setting a detection rate threshold, then respectively training authority and sensitive API (application program interface) characteristics by utilizing a Bayesian algorithm (NB), a support vector machine algorithm (SVM), a decision tree algorithm (DT) and a nearest neighbor classification algorithm (KNN), and selecting the optimal detection model to output according to the set detection rate threshold.
As shown in fig. 8, in another embodiment of the malware detection method provided by the present invention, the method for selecting the optimal detection model output according to the set detection threshold specifically includes:
if the detection rate of the detection model obtained through training is not less than the threshold value, outputting the corresponding detection model;
if the detection rate of the detection model obtained through training is lower than the threshold value, changing the combination mode of the characteristics, retraining to obtain a new detection model until the threshold value requirement is met, and outputting the detection model meeting the threshold value requirement;
and if all possible feature combination modes are traversed and the detection rate of the obtained detection model still fails to meet the threshold requirement, outputting the detection model with the highest detection rate in the traversal process.
As shown in table 2, the results of testing the detection performance in the embodiment of the malware detection method provided by the present invention are shown.
TABLE 2 test performance results
The specific implementation method comprises the following steps:
5000 normal software detected in the Anzhi market and 5000 malicious software on VirusShare are used as sample sets, and a 10-fold cross validation method is adopted for testing (10-fold cross validation is to divide sample data into 10 mutually exclusive subsets with similar sizes, a union set of 9 subsets is used as a training set each time, the rest subset is used as a testing set, 10 times of training and testing are carried out, and the average value of the 10 testing results is finally obtained).
And respectively carrying out comparative analysis on the detection performances of the unused characteristic selection algorithm and the characteristic selection algorithm based on characteristic frequency, information gain and statistics.
Wherein the significance of the performance index is as follows
(1) The TPR (detection rate) is the ratio of the correct positive case to the actual positive case for the final classification of the classifier, and the greater the TPR, the better the classification effect of the classifier on the positive case is. The calculation formula is as follows:
(2) FPR (false alarm rate) is the ratio of the positive case and the actual counter case of the final classification error of the classifier, and the larger the FPR is, the poorer the classification effect of the classifier on the counter case is. The calculation formula is as follows:
(3) acc (accuracy) is the ratio of all the finally correctly classified samples of the classifier to the total samples, and represents the accurate classification degree of the classifier, and the larger the Acc is, the better the whole classification capability of the classifier is. The calculation formula is as follows:
in the formula, the number of the samples with the TP (true example) being true is detected as the number of the true examples, namely the correct true examples are detected; the FP (false positive example) is the sample whose true condition is negative, and is detected as the number of positive examples, namely the negative example of the detection error; FN (false negative) is the true case is the positive sample is detected as the number of the negative examples, namely the classification error positive example; the sample whose true TN is true is detected as the number of counter-examples, i.e. the correct counter-example is detected.
According to the test detection results, after the feature selection algorithm is used for optimally selecting the software features, the detection rate and the accuracy of the detection model obtained by training the software features by using four different machine learning classification algorithms are higher than those of the detection model obtained by the traditional malware detection method without using the feature selection algorithm, and the false alarm rate of the detection model obtained by the malware detection method is lower than that of the detection model obtained by the traditional malware detection method.
The malicious software detection method provided by the invention has higher detection efficiency and better detection effect.
In another aspect of the invention, a malware detection apparatus is provided.
In some embodiments of a malware detection apparatus provided by the present invention, the apparatus comprises:
a feature extraction module: the system comprises a sample set software, a sample set feature matrix and a sample set feature matrix, wherein the sample set software is used for extracting feature information of the sample set software, and the feature information is abstracted into a digital form to obtain the sample set feature set and the sample set feature matrix;
a subset generation module: the characteristic selection algorithm is used for filtering invalid characteristics in the sample set characteristic set to obtain an optimal characteristic subset;
a detection model generation module: and the detection model is generated by training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm.
In another aspect of the invention, a malware detection electronic device is provided.
In some embodiments of the present invention, an electronic device for malware detection includes:
a memory, a processor, and a computer program stored on the memory and executable on the processor.
When the processor executes the program, the malicious software detection method provided by the invention is realized.
The apparatus and the electronic device of the foregoing embodiments are used to implement the corresponding method in the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (7)
1. A malware detection method, comprising:
extracting feature information of sample set software, abstracting the feature information into a digital form, and obtaining a sample set feature set and a sample set feature matrix;
filtering invalid features in the feature set by using a feature selection algorithm to obtain an optimal feature subset;
training a feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a detection model;
wherein the filtering the invalid features in the sample set feature set by using the feature selection algorithm to obtain the optimal feature subset comprises:
the method comprises the following steps: initializing and setting the sample set feature set, the relevant parameter constants in the sample set feature matrix and the relevant parameters used in the subset generation process, including:
characterizing said sample set as FvThe sample set feature matrix is XtrainThe number of selected features is Mv(ii) a Setting an initial threshold value of information gain to a specific value thetaigSetting the information gain step length as lambda, setting the initial value n of the information gain circulation step number as 0, and setting the detection rate threshold value as 0.95; utilizing a machine learning classification algorithm to perform a feature matrix X on the sample settrainTraining is carried out, and the maximum detection rate is recorded as TPmax;
Step two: calculating the characteristic frequency of each characteristic in the sample set characteristic set according to a characteristic frequency calculation formula, and filtering out irrelevant characteristics through calculation and comparison to obtain a irrelevant characteristic removing subset;
the characteristic frequency calculation formula is as follows:
wherein, TF (f)j) Representing a feature fjCharacteristic frequency of (1), NbenignIndicating a normal number of samples in a normal software set,representing a feature fjThe number of samples present; n is a radical ofmalwareRepresenting the number of malicious samples in the set of malicious samples,is characterized byjThe number of samples present;
the calculating the characteristic frequency of each characteristic in the sample set characteristic set according to the characteristic frequency calculation formula, and filtering out irrelevant characteristics through calculation and comparison to obtain a irrelevant characteristic subset, comprises:
step 1: calculating the characteristic frequency of each characteristic in the characteristic set of the sample set;
step 2: filtering out the features with the feature frequency value of 0, wherein the rest features form an intermediate feature subset
And step 3: the intermediate feature subsets are classified by machine learningTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf;
And 4, step 4: filtering out the intermediate feature subsetFeatures with the smallest medium feature frequency, the remaining features constituting a subset of featuresThe feature subsets are classified by machine learningTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf′;
And 5: comparison of TPtfAnd TPtfValue of' if TPtf=TPtf', then subset the featuresIntermediate feature subset as newReturning to the step 3; if TPtf≠TPtf' outputting the intermediate feature subset
Step 6: subset the intermediate featuresIs expressed as a feature subset Fv1The number of selected features is Mv1Said subset of features Fv1I.e. the decorrelated feature subset;
step three: calculating the information gain of each feature in the decorrelation feature subset according to an information gain calculation formula, and obtaining a denoising feature subset through calculation, comparison and screening;
the information gain calculation formula is as follows:
IG(fj)=H(Y)-H(Y|fj)
wherein, IG (f)j) Representing a feature fjInformation gain for classification system, H (Y) represents entropy of classification system, H (Y | f)j) Representing the conditional entropy of the classification system;
step four: according to chi2A statistic calculation formula for calculating CHI value (χ) of each feature in the de-noised feature subset and the corresponding feature matrix2Statistic value) and CHI value between the features, and obtaining redundancy-removing feature subsets through calculation, comparison and screening;
the x2The statistical value calculation formula is as follows:
CHI(fi,fj)=ξ11+ξ12+ξ21+ξ22
wherein CHI (f)i,fj) Representing a feature fi,fjChi of2Statistical value xi11Representing a feature fiAnd feature fjSimultaneous deviation of theoretical and actual values, ξ12Representing a feature fiThe feature f does not appear in the appearing samplejDeviation of the theoretical value from the actual value, ξ21Indicates the absence of feature fiIs present in the number of samples of (a) to (b)jDeviation of the theoretical value from the actual value, ξ22Indicates the absence of feature fiHas no feature f in the number of samplesjDeviation of the theoretical value from the actual value of (a);
step five: and analyzing and judging the redundancy-removing feature subset, and performing subset optimization according to a judgment result to obtain an optimal feature subset.
2. The method of claim 1, wherein the extracting feature information of the sample set software, and abstracting the feature information into a digital form to obtain a sample set feature set and a sample set feature matrix comprises:
processing the sample set software installation package to obtain a global configuration file containing authority information and a decompiled file containing API information;
extracting corresponding authority and API characteristic information from the global configuration file and the decompilated file;
and vectorizing and abstracting the extracted authority and API characteristic information into a digital form to obtain a sample set characteristic set and a sample set characteristic matrix.
3. The method of claim 1, wherein step three comprises:
step 1: calculating an information gain for each feature in the decorrelated subset of features;
step 2: adding 1 to the number of the circulation steps on the basis of the original number, namely n is n + 1;
and step 3: selecting a composition satisfying IG > (theta)igFeatures of- (n-1) lambda) constitute a subset of featuresThe number of selected features is recorded asSelecting a composition satisfying IG > (theta)ig-n λ) of features constituting a feature subsetThe number of selected features is recorded as
And 4, step 4: comparisonAnda value of, ifReturning to the step 2; if it is not Subset of output features
4. The method of claim 3, wherein the fourth step comprises:
step 1: computing the de-noised feature subset Fv2Each feature in (b) and the CHI value of the corresponding feature matrix, and the largest CHI value is recorded as θchi;
Step 2: calculating CHI values between features, and selecting CHI values between features larger than thetachiAnd selecting the features with smaller IG value, and arranging the features from CHI value from large to small to form a redundant feature setThe number of redundant features selected is
And step 3: setting the cycle step number m to be 0;
and 4, step 4: adding 1 to the number of the circulation steps on the basis of the original number, namely m is m + 1;
and 5: according to the redundant feature setArranged order culling of medium redundancy features Fv2To obtain a subset of featuresThe number of selected features isFeature subsets by machine learning classification algorithmsTraining the corresponding characteristic matrix to obtain the corresponding detection rate
5. The method according to claim 4, wherein the step five is specifically:
comparing the detection rate in the fourth stepAnd the maximum detection rate TPmaxComparing the two values, and assigning the larger value to TPmax(ii) a The TP ismaxComparing with the initial set detection rate threshold value of 0.95 if TPmaxIf the value is less than 0.95, returning to the third step; if TPmaxGreater than or equal to 0.95, then TPmaxThe corresponding feature subset is the optimal feature subset, and the optimal feature subset is denoted as Fv。
6. A malware detection apparatus, comprising:
the characteristic extraction module is used for extracting the characteristic information of the sample set software, abstracting the characteristic information into a digital form and obtaining a sample set characteristic set and a sample set characteristic matrix;
the subset generation module is used for filtering invalid features in the feature set by using a feature selection algorithm to obtain an optimal feature subset;
the detection model generation module is used for training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a detection model;
the subset generation module filters invalid features in the sample set feature set by using a feature selection algorithm to obtain an optimal feature subset, and the method comprises the following steps:
the method comprises the following steps: initializing and setting the sample set feature set, the relevant parameter constants in the sample set feature matrix and the relevant parameters used in the subset generation process, including:
characterizing said sample set as FvThe sample set feature matrix is XtrainThe number of selected features is Mv(ii) a Setting an initial threshold value of information gain to a specific value thetaigSetting the information gain step length as lambda, setting the initial value n of the information gain circulation step number as 0, and setting the detection rate threshold value as 0.95; utilizing a machine learning classification algorithm to perform a feature matrix X on the sample settrainTraining is carried out, and the maximum detection rate is recorded as TPmax;
Step two: calculating the characteristic frequency of each characteristic in the sample set characteristic set according to a characteristic frequency calculation formula, and filtering out irrelevant characteristics through calculation and comparison to obtain a irrelevant characteristic removing subset;
the characteristic frequency calculation formula is as follows:
wherein, TF (f)j) Representing a feature fjCharacteristic frequency of (1), NbenignIndicating a normal number of samples in a normal software set,representing a feature fjThe number of samples present; n is a radical ofmalwareRepresenting the number of malicious samples in the set of malicious samples,is characterized byjThe number of samples present;
the subset generation module calculates the characteristic frequency of each characteristic in the sample set characteristic set according to a characteristic frequency calculation formula, and obtains a decorrelation characteristic subset by calculating, comparing and filtering out irrelevant characteristics, and the subset generation module comprises:
step 1: calculating the characteristic frequency of each characteristic in the characteristic set of the sample set;
step 2: filtering out the features with the feature frequency value of 0, wherein the rest features form an intermediate feature subset
And step 3: the intermediate feature subsets are classified by machine learningTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf;
And 4, step 4: filter elementExcluding the intermediate feature subsetFeatures with the smallest medium feature frequency, the remaining features constituting a subset of featuresThe feature subsets are classified by machine learningTraining the corresponding characteristic matrix to obtain the corresponding detection rate TPtf′;
And 5: comparison of TPtfAnd TPtfValue of' if TPtf=TPtf', then subset the featuresIntermediate feature subset as newReturning to the step 3; if TPtf≠TPtf' outputting the intermediate feature subset
Step 6: subset the intermediate featuresIs expressed as a feature subset Fv1The number of selected features is Mv1Said subset of features Fv1I.e. the decorrelated feature subset;
step three: calculating the information gain of each feature in the decorrelation feature subset according to an information gain calculation formula, and obtaining a denoising feature subset through calculation, comparison and screening;
the information gain calculation formula is as follows:
IG(fj)=H(Y)-H(Y|fj)
wherein, IG (f)j) Representing a feature fjInformation gain for classification system, H (Y) represents entropy of classification system, H (Y | f)j) Representing the conditional entropy of the classification system;
step four: according to chi2A statistic calculation formula for calculating CHI value (χ) of each feature in the de-noised feature subset and the corresponding feature matrix2Statistic value) and CHI value between the features, and obtaining redundancy-removing feature subsets through calculation, comparison and screening;
the x2The statistical value calculation formula is as follows:
CHI(fi,fj)=ξ11+ξ12+ξ21+ξ22
wherein CHI (f)i,fj) Representing a feature fi,fjChi of2Statistical value xi11Representing a feature fiAnd feature fjSimultaneous deviation of theoretical and actual values, ξ12Representing a feature fiThe feature f does not appear in the appearing samplejDeviation of the theoretical value from the actual value, ξ21Indicates the absence of feature fiIs present in the number of samples of (a) to (b)jDeviation of the theoretical value from the actual value, ξ22Indicates the absence of feature fiHas no feature f in the number of samplesjDeviation of the theoretical value from the actual value of (a);
step five: and analyzing and judging the redundancy-removing feature subset, and performing subset optimization according to a judgment result to obtain an optimal feature subset.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811495637.7A CN109784046B (en) | 2018-12-07 | 2018-12-07 | Malicious software detection method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811495637.7A CN109784046B (en) | 2018-12-07 | 2018-12-07 | Malicious software detection method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109784046A CN109784046A (en) | 2019-05-21 |
CN109784046B true CN109784046B (en) | 2021-02-02 |
Family
ID=66495778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811495637.7A Active CN109784046B (en) | 2018-12-07 | 2018-12-07 | Malicious software detection method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109784046B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110464345B (en) * | 2019-08-22 | 2020-10-30 | 北京航空航天大学 | Independent head biological power supply signal interference elimination method and system |
CN110990834B (en) * | 2019-11-19 | 2022-12-27 | 重庆邮电大学 | Static detection method, system and medium for android malicious software |
CN110955895B (en) * | 2019-11-29 | 2022-03-29 | 珠海豹趣科技有限公司 | Operation interception method and device and computer readable storage medium |
CN112632539B (en) * | 2020-12-28 | 2024-04-09 | 西北工业大学 | Dynamic and static hybrid feature extraction method in Android system malicious software detection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2128798A1 (en) * | 2008-05-27 | 2009-12-02 | Deutsche Telekom AG | Unknown malcode detection using classifiers with optimal training sets |
CN104298715A (en) * | 2014-09-16 | 2015-01-21 | 北京航空航天大学 | TF-IDF based multiple-index result merging and sequencing method |
CN105320887A (en) * | 2015-10-12 | 2016-02-10 | 湖南大学 | Static characteristic extraction and selection based detection method for Android malicious application |
CN107577942A (en) * | 2017-08-22 | 2018-01-12 | 中国民航大学 | A kind of composite character screening technique for Android malware detection |
-
2018
- 2018-12-07 CN CN201811495637.7A patent/CN109784046B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2128798A1 (en) * | 2008-05-27 | 2009-12-02 | Deutsche Telekom AG | Unknown malcode detection using classifiers with optimal training sets |
CN104298715A (en) * | 2014-09-16 | 2015-01-21 | 北京航空航天大学 | TF-IDF based multiple-index result merging and sequencing method |
CN105320887A (en) * | 2015-10-12 | 2016-02-10 | 湖南大学 | Static characteristic extraction and selection based detection method for Android malicious application |
CN107577942A (en) * | 2017-08-22 | 2018-01-12 | 中国民航大学 | A kind of composite character screening technique for Android malware detection |
Also Published As
Publication number | Publication date |
---|---|
CN109784046A (en) | 2019-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109784046B (en) | Malicious software detection method and device and electronic equipment | |
Rouhani et al. | Deepsigns: A generic watermarking framework for ip protection of deep learning models | |
CN106709345B (en) | Method, system and equipment for deducing malicious code rules based on deep learning method | |
Yasaei et al. | Gnn4tj: Graph neural networks for hardware trojan detection at register transfer level | |
CN111600919B (en) | Method and device for constructing intelligent network application protection system model | |
KR20170098733A (en) | Method of testing the resistance of a circuit to a side channel analysis of second order or more | |
CN110287735B (en) | Trojan horse infected circuit identification method based on chip netlist characteristics | |
CN111614599A (en) | Webshell detection method and device based on artificial intelligence | |
CN111062036A (en) | Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment | |
CN112329713A (en) | Network flow abnormity online detection method, system, computer equipment and storage medium | |
Wang et al. | Characteristic examples: High-robustness, low-transferability fingerprinting of neural networks | |
Halim et al. | Recurrent neural network for malware detection | |
Brown et al. | Detection of mobile malware: an artificial immunity approach | |
CN114662602A (en) | Outlier detection method and device, electronic equipment and storage medium | |
CN114239083A (en) | Efficient state register identification method based on graph neural network | |
CN109213850B (en) | System and method for determining text containing confidential data | |
Rouhani et al. | DeepSigns: a generic watermarking framework for protecting the ownership of deep learning models | |
Rahmani et al. | Closed-form, provable, and robust pca via leverage statistics and innovation search | |
CN109784047B (en) | Program detection method based on multiple features | |
Gad et al. | Active learning on weighted graphs using adaptive and non-adaptive approaches | |
Kabin et al. | Horizontal Attacks using K-Means: Comparison with Traditional Analysis Methods | |
CN115758337A (en) | Back door real-time monitoring method based on timing diagram convolutional network, electronic equipment and medium | |
CN115643065A (en) | Network attack event detection method and system | |
WO2023129762A2 (en) | A design automation methodology based on graph neural networks to model integrated circuits and mitigate hardware security threats | |
Chakraborty et al. | Dynamarks: Defending against deep learning model extraction using dynamic watermarking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |