CN109784046B

CN109784046B - Malicious software detection method and device and electronic equipment

Info

Publication number: CN109784046B
Application number: CN201811495637.7A
Authority: CN
Inventors: 胡一博; 朱诗兵; 李长青; 帅海峰; 吕登龙; 徐华正; 张记瑞
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2021-02-02
Anticipated expiration: 2038-12-07
Also published as: CN109784046A

Abstract

The invention discloses a malicious software detection method, a malicious software detection device and electronic equipment, relates to the field of security protection of mobile terminals, can effectively detect malicious software, and solves the problems of redundancy, irrelevance and noise existing in the extraction of malicious software features in the prior art. The malware detection method comprises the following steps: extracting characteristics; generating a subset; and generating a detection model. The malware detection apparatus includes: the device comprises a feature extraction module, a subset generation module and a detection model generation module. The electronic equipment comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the processor realizes the malicious software detection method when executing the program.

Description

Malicious software detection method and device and electronic equipment

Technical Field

The present invention relates to security protection of a mobile terminal, and in particular, to a method and an apparatus for detecting malicious software, and an electronic device.

Background

The mobile intelligent terminal is a general name of various mobile terminals with network access capability and provided with an operating system and an application program. The great popularization and powerful functions of the mobile intelligent terminal including the Android system make the mobile intelligent terminal become an irreplaceable tool in various fields of modern society. Meanwhile, along with the mobile intelligent terminal, malicious software is rampant gradually, and the malicious software destroys a user system in an undetected state, steals user data and charges, and seriously threatens the privacy and property safety of a user. More seriously, relevant confidential information of the country, such as economy, politics, military affairs and the like, is threatened, and the security of the country is damaged. In order to deal with the increasing threat of malicious software of a mobile intelligent terminal and meet the detection requirement of the future mobile intelligent terminal on unknown malicious software, a detection method for Android malicious software is needed.

The existing detection method adopting machine learning starts from the artificial intelligence perspective, the classification algorithm is utilized to learn the characteristics of known malicious software, and an continuously evolving and generalized intelligent monitoring model is constructed to realize the automatic intelligent detection of the Android software. The key of the detection method is the selection of the characteristics, and the more effectively the selected characteristics can distinguish the malicious software from the normal software, the higher the efficiency of an intelligent detection model obtained by utilizing a machine learning classification algorithm is, and the better the detection effect on the malicious software is. However, the features of malware extracted by the existing method have the problems of redundancy, irrelevance and noise: the redundancy of the features influences the calculation efficiency of the classification algorithm and reduces the effectiveness of the detection model; the irrelevance of the characteristics leads to the need of more training samples to obtain a proper detection model; noise interference of features can directly lead to the construction of wrong detection models. The above-mentioned problems can greatly increase the consumption of machine learning in time and space, thereby causing the classification algorithm to be completely ineffective in analyzing and processing the features due to high cost.

Disclosure of Invention

In view of this, the present invention provides a detection method, an apparatus and an electronic device, which can meet the detection requirement of a mobile terminal on unknown malware and solve the problems of redundancy, irrelevance and noise faced by feature selection.

Based on the above purpose, the present invention provides a malware detection method. The malware detection method comprises the following steps:

extracting feature information of sample set software, abstracting the feature information into a digital form, and obtaining a sample set feature set and a sample set feature matrix;

filtering invalid features in the feature set by using a feature selection algorithm to obtain an optimal feature subset;

and training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a detection model.

Optionally, the extracting feature information of the sample set software, and abstracting the feature information into a digital form to obtain a sample set feature set and a sample set feature matrix includes:

processing the sample set software installation package to obtain a global configuration file containing authority information and a decompiled file containing API information;

extracting corresponding authority and API characteristic information from the global configuration file and the decompilated file;

and vectorizing and abstracting the extracted authority and API characteristic information into a digital form to obtain a sample set characteristic set and a sample set characteristic matrix.

Optionally, the filtering invalid features in the sample set feature set by using a feature selection algorithm to obtain an optimal feature subset includes:

the method comprises the following steps: initializing and setting the sample set feature set and the related parameter constants in the sample set feature matrix and the related parameters used in the subset generation process;

step two: calculating the characteristic frequency of each characteristic in the sample set characteristic set according to a characteristic frequency calculation formula, and filtering out irrelevant characteristics through calculation and comparison to obtain a irrelevant characteristic removing subset;

the characteristic frequency calculation formula is as follows:

wherein, TF (f)^j) Representing a feature f^jCharacteristic frequency of (1), N_benignIndicating a normal number of samples in a normal software set,

representing a feature f^jThe number of samples present; n is a radical of_malwareRepresenting the number of malicious samples in the set of malicious samples,

is characterized by^jThe number of samples present;

step three: calculating the information gain of each feature in the decorrelation feature subset according to an information gain calculation formula, and obtaining a denoising feature subset through calculation, comparison and screening;

the information gain calculation formula is as follows:

IG(f^j)＝H(Y)-H(Y|f^j)

wherein, IG (f)^j) Representing a feature f^jInformation gain for classification system, H (Y) represents entropy of classification system, H (Y | f)^j) Representing the conditional entropy of the classification system;

step four: according to chi²A statistic calculation formula for calculating CHI value (χ) of each feature in the de-noised feature subset and the corresponding feature matrix²Statistic value) and CHI value between the features, and obtaining redundancy-removing feature subsets through calculation, comparison and screening;

the x²The statistical value calculation formula is as follows:

CHI(fⁱ，f^j)＝ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

wherein CHI (f)ⁱ，f^j) Representing a feature fⁱ，f^jChi of²Statistical value xi₁₁Representing a feature fⁱAnd feature f^jSimultaneous deviation of theoretical and actual values, ξ₁₂Representing a feature fⁱThe feature f does not appear in the appearing sample^jDeviation of the theoretical value from the actual value, ξ₂₁Indicates the absence of feature fⁱIs present in the number of samples of (a) to (b)^jDeviation of the theoretical value from the actual value, ξ₂₂Indicates the absence of feature fⁱHas no feature f in the number of samples^jDeviation of the theoretical value from the actual value of (a);

step five: and analyzing and judging the redundancy-removing feature subset, and performing subset optimization according to a judgment result to obtain an optimal feature subset.

Optionally, the step one specifically includes:

characterizing said sample set as F_vThe sample set feature matrix is X_trainThe number of selected features is M_v(ii) a Setting an initial threshold value of information gain to a specific value theta_igSetting the information gain step length as lambda and setting the information gain cycle step numberSetting the initial value n to be 0, and setting the detection rate threshold value to be 0.95; utilizing a machine learning classification algorithm to perform a feature matrix X on the sample set_trainTraining is carried out, and the maximum detection rate is recorded as TP_max。

Optionally, the second step includes:

step 1: calculating the characteristic frequency of each characteristic in the characteristic set of the sample set;

step 2: filtering out the features with the feature frequency value of 0, wherein the rest features form an intermediate feature subset

And step 3: the intermediate feature subsets are classified by machine learning

Training the corresponding characteristic matrix to obtain the corresponding detection rate TP_tf；

And 4, step 4: filtering out the intermediate feature subset

Features with the smallest medium feature frequency, the remaining features constituting a subset of features

The feature subsets are classified by machine learning

Training the corresponding characteristic matrix to obtain the corresponding detection rate TP_tf′；

And 5: comparison of TP_tfAnd TP_tfValue of' if TP_tf＝TP_tf', then subset the features

Intermediate feature subset as new

Returning to the step 3; if TP_tf≠TP_tf' outputting the intermediate feature subset

Step 6: subset the intermediate features

Is expressed as a feature subset F_v1The number of selected features is M_v1Said subset of features F_v1I.e. the decorrelated feature subset.

Optionally, the third step includes:

step 1: calculating an information gain for each feature in the decorrelated subset of features;

step 2: adding 1 to the number of the circulation steps on the basis of the original number, namely n is n + 1;

and step 3: select out of satisfaction IG>(θ_igFeatures of- (n-1) lambda) constitute a subset of features

The number of selected features is recorded as

Select out of satisfaction IG>(θ_ig-n λ) of features constituting a feature subset

The number of selected features is recorded as

And 4, step 4: comparison

And

a value of, if

Returning to the step 2; if it is not

Subset of output features

And 5: the feature subset to be output

Is expressed as a feature subset F_v2The number of selected features is M_v2Said subset of features F_v2I.e. the subset of de-noised features.

Optionally, the fourth step includes:

step 1: computing the de-noised feature subset F_v2Each feature in (b) and the CHI value of the corresponding feature matrix, and the largest CHI value is recorded as θ_chi；

Step 2: calculating CHI values between features, and selecting CHI values between features larger than theta_chiAnd selecting the features with smaller IG value, and arranging the features from CHI value from large to small to form a redundant feature set

The number of redundant features selected is

And step 3: setting the cycle step number m to be 0;

and 4, step 4: adding 1 to the number of the circulation steps on the basis of the original number, namely m is m + 1;

and 5: according to the redundant feature set

Arranged order culling of medium redundancy features F_v2To obtain a subset of features

The number of selected features is

Feature subsets by machine learning classification algorithms

Training the corresponding characteristic matrix to obtain the corresponding detection rate

Step 6: comparing m with

A value of, if

Returning to the step 4; otherwise, executing the next step;

and 7: comparing all detection rates

The maximum detection rate was recorded as

Maximum detection rate

The corresponding feature subset is denoted as F_v3The number of selected features is M_v3Said subset of features F_v3I.e. a subset of de-redundant features.

Optionally, the step five specifically includes:

comparing the detection rate in the fourth step

And the maximum detection rate TP_maxComparing the two values, and assigning the larger value to TP_max(ii) a The TP is_maxComparing with the initial set detection rate threshold value of 0.95 if TP_max<0.95, returning to the third step; if TP_maxGreater than or equal to 0.95, then TP_maxThe corresponding feature subset is the optimal feature subset, and the optimal feature subset is denoted as F_v。

Optionally, the method for generating a detection model specifically includes:

setting a detection rate threshold, training a feature matrix corresponding to the optimal feature subset by respectively utilizing a Bayesian algorithm, a support vector machine algorithm, a decision tree algorithm and a nearest neighbor classification algorithm, and selecting the optimal detection model to output according to the set detection rate threshold.

The method for selecting the optimal detection model according to the set detection rate threshold specifically comprises the following steps:

if the detection rate of the detection model obtained through training is not less than the threshold value, outputting the corresponding detection model;

if the detection rate of the detection model obtained through training is lower than the threshold value, changing the combination mode of the characteristics, retraining to obtain a new detection model until the threshold value requirement is met, and outputting the detection model meeting the threshold value requirement;

and if all possible feature combination modes are traversed and the detection rate of the obtained detection model still fails to meet the threshold requirement, outputting the detection model with the highest detection rate in the traversal process.

The invention also provides a malicious software detection device, which comprises:

a feature extraction module: the system comprises a sample set software, a sample set feature matrix and a sample set feature matrix, wherein the sample set software is used for extracting feature information of the sample set software, and the feature information is abstracted into a digital form to obtain the sample set feature set and the sample set feature matrix;

a subset generation module: the characteristic selection algorithm is used for filtering invalid characteristics in the sample set characteristic set to obtain an optimal characteristic subset;

a detection model generation module: and the detection model is generated by training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm.

The invention also provides electronic equipment for detecting the malicious software, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the malicious software detection method provided by the invention.

From the above, it can be seen that the malware detection method, the malware detection device and the electronic device provided by the invention can effectively detect malware by extracting the permission and sensitive API features of malware and normal software, performing optimal selection on the extracted features by using a feature selection algorithm, and training the selected permission and sensitive API combined features by using a machine learning classification algorithm. The adopted feature selection algorithm is based on feature frequency, information gain and chi²Statistical design: filtering features irrelevant to classification in the feature set by using a feature frequency method, selecting features having large influence on classification by using an information gain method, and adopting Chi²The statistical method eliminates the characteristic with high redundancy in the characteristic set. Therefore, the malicious software detection method provided by the invention can well overcome the problems of redundancy, irrelevance and noise existing in the malicious software features extracted by the prior art. The feature selection algorithm is to select feature frequency, information gain and x²The three methods are combined according to the preferred specific sequence, and compared with the method that the three methods are simply combined or one or two of the three methods are selected, the method has a better optimization selection effect, the method trains the feature subsets obtained by the feature selection algorithm by utilizing the machine learning classification algorithm, the finally obtained detection model has higher efficiency, and the detection effect on the malicious software is better.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram illustrating a malware detection method according to an embodiment of the present invention;

FIG. 2 is a block flow diagram of a malware detection method in an embodiment of the invention;

fig. 3 is a flow chart of a feature extraction method in the malware detection method in the embodiment of the present invention;

FIG. 4 is a diagram illustrating a subset generation method in the malware detection method according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method for decorrelating feature frequencies in a malware detection method in an embodiment of the present invention;

FIG. 6 is a flowchart illustrating an information gain denoising method in a malware detection method according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating χ "in the malware detection method according to the embodiment of the present invention²A statistical redundancy removal method flow diagram;

fig. 8 is a schematic diagram illustrating a method for generating a detection model in a malware detection method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

In one aspect of the invention, a malware detection method is provided.

As shown in fig. 1 and 2, in some embodiments of a malware detection method provided by the present invention, the malware detection method specifically includes:

s101: and (5) feature extraction. Decompiling the software sample by using reverse engineering, extracting characteristic information, abstracting the characteristic information into a sample set characteristic set and a sample set characteristic matrix which are easy to analyze and in a digital form, and storing the sample set characteristic set and the sample set characteristic matrix into a database;

s102: and (4) generating a subset. Filtering invalid features in the sample set feature set by using a feature selection algorithm to obtain an optimal feature subset;

s103: and generating a detection model. And training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a dual-feature detection model.

As shown in fig. 3, in another embodiment of a malware detection method provided by the present invention, the feature extraction method specifically includes:

s301: processing the software APK package, decoding an android manifest xml file containing authority information into a global configuration file in a plaintext format, and inversely compiling a classes.

S302: extracting corresponding authority and API characteristic information from the global configuration file and the smali file;

s303: vectorizing and abstracting the extracted feature semantic information into a digital form to obtain a sample set feature set and a sample set feature matrix. The sample set feature matrix specifically comprises: a feature is represented in the sample by 1, the feature is represented in the sample by 0, and finally a binary sample set feature matrix is used to describe the features of the sample set, wherein the rows represent sample vectors and the columns represent feature vectors.

As shown in fig. 4, in another embodiment of a malware detection method provided by the present invention, the subset generation method specifically includes:

s401, step one: constant initialization: and initializing and setting the sample set feature set, the related parameter constants in the sample set feature matrix and the related parameters used in the subset generation process.

S402, step two: the characteristic frequencies are decorrelated: and calculating the characteristic frequency of each characteristic in the characteristic set of the sample set according to a characteristic frequency calculation formula. Filtering irrelevant features through calculation comparison to obtain a irrelevant feature removing subset;

the characteristic frequency calculation formula is as follows:

wherein, TF (f)^j) Representing a feature f^jThe characteristic frequency of (c). N is a radical of_benignIndicating a normal number of samples in a normal software set,

is characterized by^jNumber of samples present.

Step S403, III: information gain denoising: calculating the information gain of each feature in the decorrelation feature subset according to an information gain calculation formula, and obtaining a denoising feature subset through calculation, comparison and screening;

the information gain calculation formula is as follows:

IG(f^j)＝H(Y)-H(Y|f^j)

wherein, IG (f)^j) Representing a feature f^jInformation gain for classification system, H (Y) represents entropy of classification system, H (Y | f)^j) Representing the conditional entropy of the classification system.

The information gain calculation formula is specifically explained as follows:

the probability of the occurrence of a normal software sample is P (c)₀) The probability of the occurrence of a malware sample is P (c)₁) Then the entropy of the classification system is defined as:

given characteristic f^jConditional probability of each class P (c) of occurrence_i|f^j1), the conditional entropy of the classification system is defined as:

then, characteristic f^jWhen not present, the entropy of the classification system is defined as:

wherein the probability P (c)_i) Has a value of c_iThe proportion of the number of class samples to the total number of training samples; probability P (f)^jThe value of 1) is the occurrence characteristic f^jThe ratio of the number of samples to the total number of samples, the probability P (f)^j0) is the characteristic f^jThe number of non-appearing samples is the ratio of the total number of samples.

Thus, characteristic f^jInformation gain IG (f) for classification system^j) The calculation formula of (a) is as follows:

s404, step four: chi shape²And (3) statistical redundancy removal: according to chi²A statistic calculation formula for calculating CHI value (χ) of each feature in the de-noised feature subset and the corresponding feature matrix²Statistical value) and the CHI value between features. Obtaining redundancy-removing characteristic subsets through calculation, comparison and screening;

two characteristics fⁱ，f^jChi of²The statistical value calculation formula is as follows:

CHI(fⁱ，f^j)＝ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

wherein CHI (f)ⁱ，f^j) Representing a feature fⁱ，f^jChi of²Statistical value xi₁₁Representing a feature fⁱAnd feature f^jDeviations of the theoretical value and the actual value occurring at the same time; xi₁₂Representing a feature fⁱThe feature f does not appear in the appearing sample^jDeviation of the theoretical value from the actual value of (a); xi₂₁Indicates the absence of feature fⁱIs present in the number of samples of (a) to (b)^jDeviation of the theoretical value from the actual value of (a); xi₂₂Indicates the absence of feature fⁱHas no feature f in the number of samples^jDeviation of the theoretical value from the actual value.

χ²The statistical value calculation formula is specifically explained as follows:

χ²the statistics being based on actual and theoretical valuesThe deviation measures the degree of correlation between the features and the categories. Suppose two features fⁱAnd f^jThe number of samples in which both features occur simultaneously is

The number of simultaneously non-appearing samples is

Characteristic fⁱIs present and f^jThe number of samples not appearing is

Characteristic fⁱNot present but f^jThe number of samples appearing is

The specific relationship between them is shown in table 1:

TABLE 1 characteristic distribution Table

Where N is the total number of samples, the value of which is the sum of the four cases, i.e.

Thus, the feature f can be obtainedⁱThe frequency of occurrence is:

characteristic f^jThe number of samples appearing is

The characteristic f appears theoretically^jIn the sample of (2), the feature f also appearsⁱThe number of samples of (a) is:

then, characteristic fⁱAnd f^jDeviation xi of given theoretical value and actual value simultaneously₁₁Comprises the following steps:

in the same way, the feature f can be obtainedⁱOccurrence of characteristic f^jNumber of non-existent theoretical samples E₁₂Characteristic fⁱNot present but feature f^jNumber of theoretical samples E present₂₁Characteristic fⁱAnd characteristic f^jNumber of theoretical samples E that did not appear₂₂And their theoretical and actual values deviate ξ₁₂、ξ₂₁、ξ₂₂The calculation formula is as follows:

thus, two characteristics fⁱAnd f^jChi of²The statistical value is deviation xi₁₁、ξ₁₂、ξ₂₁、ξ₂₂To sum, i.e.

S405, step five: generating an optimal feature subset: and analyzing and judging the redundancy-removed feature subset, and performing further operation according to a judgment result to obtain a final optimal feature subset.

Wherein, the first step is specifically as follows:

recording the sample set characteristic matrix as X_trainThe selected feature set is F_vThe number of selected features is M_v. Setting an initial threshold value of information gain to a specific value theta_igThe information gain step is set to λ, the information gain cycle step number n is set to 0, and the detection rate threshold is set to 0.95. Using machine learning classification algorithm to carry out on original feature matrix X_trainTraining is carried out, and the maximum detection rate is recorded as TP_max。

As shown in fig. 5, the second step specifically includes:

s501: calculating the characteristic frequency of each characteristic in all the sample set characteristic sets;

s502: filtering out the features with the feature frequency value of 0, wherein the rest features form an intermediate feature subset

S503: intermediate feature subsets by machine learning classification algorithms

S504: filtering out intermediate feature subsets

Feature subsets by machine learning classification algorithms

S505: comparison of TP_tfAnd TP_tfValue of' if TP_tf＝TP_tf', then subset the features

Intermediate feature subset as new

Returning to step S503; if TP_tf≠TP_tf', output the intermediate feature subset

S506: denote the feature subset of the output as F_v1The number of selected features is M_v1Said subset of features F_v1I.e. the decorrelated feature subset.

As shown in fig. 6, the third step specifically includes:

s601, calculating the information gain of each feature in the decorrelation feature subset;

s602: adding 1 to the number of the circulation steps on the basis of the original number, namely n is n + 1;

s603: select out of satisfaction IG>(θ_igFeatures of- (n-1) lambda) constitute a subset of features

The number of selected features is recorded as

The number of selected features is recorded as

S604: comparison

And

a value of, if

Returning to step S602; if it is not

Subset of output features

S605: denote the feature subset of the output as F_v2The number of selected features is M_v2Said subset of features F_v2I.e. the subset of de-noised features.

As shown in fig. 7, the fourth step specifically includes:

s701: computing a feature subset F_v2Each feature in (1) is associated with a corresponding feature matrix CHI value (χ)²Statistical value) of the values, the greatest CHI value among the values is represented as θ_chi；

S702: calculating CHI values between features, and selecting CHI values between features larger than theta_chiAnd selecting the features with smaller IG value, and arranging the features from CHI value from large to small to form a redundant feature set

The number of redundant features selected is

S703: setting the cycle step number m to be 0;

s704: adding 1 to the number of the circulation steps on the basis of the original number, namely m is m + 1;

s705: from a set of redundant features

The number of selected features is

Feature subsets by machine learning classification algorithms

S706: comparing m with

A value of, if

Returning to step S704; otherwise, executing the next step;

s707: comparing all detection rates

The maximum detection rate was recorded as

Maximum detection rate

The method for generating the optimal feature subset in the fifth step specifically comprises the following steps:

the detection rate obtained in the fourth step

And maximum detection rate TP_maxComparing the two values, and assigning the larger value to TP_max(ii) a The TP is_maxComparing with the initial set detection rate threshold value of 0.95 if TP_max<0.95, returning to the third step; if TP_maxGreater than or equal to 0.95, then TP_maxThe corresponding feature subset, i.e. the best feature subset, is denoted as F_v。

As shown in fig. 8, in another embodiment of the malware detection method provided by the present invention, the method for generating the detection model specifically includes:

firstly, setting a detection rate threshold, then respectively training authority and sensitive API (application program interface) characteristics by utilizing a Bayesian algorithm (NB), a support vector machine algorithm (SVM), a decision tree algorithm (DT) and a nearest neighbor classification algorithm (KNN), and selecting the optimal detection model to output according to the set detection rate threshold.

As shown in fig. 8, in another embodiment of the malware detection method provided by the present invention, the method for selecting the optimal detection model output according to the set detection threshold specifically includes:

As shown in table 2, the results of testing the detection performance in the embodiment of the malware detection method provided by the present invention are shown.

TABLE 2 test performance results

The specific implementation method comprises the following steps:

5000 normal software detected in the Anzhi market and 5000 malicious software on VirusShare are used as sample sets, and a 10-fold cross validation method is adopted for testing (10-fold cross validation is to divide sample data into 10 mutually exclusive subsets with similar sizes, a union set of 9 subsets is used as a training set each time, the rest subset is used as a testing set, 10 times of training and testing are carried out, and the average value of the 10 testing results is finally obtained).

And respectively carrying out comparative analysis on the detection performances of the unused characteristic selection algorithm and the characteristic selection algorithm based on characteristic frequency, information gain and statistics.

Wherein the significance of the performance index is as follows

(1) The TPR (detection rate) is the ratio of the correct positive case to the actual positive case for the final classification of the classifier, and the greater the TPR, the better the classification effect of the classifier on the positive case is. The calculation formula is as follows:

(2) FPR (false alarm rate) is the ratio of the positive case and the actual counter case of the final classification error of the classifier, and the larger the FPR is, the poorer the classification effect of the classifier on the counter case is. The calculation formula is as follows:

(3) acc (accuracy) is the ratio of all the finally correctly classified samples of the classifier to the total samples, and represents the accurate classification degree of the classifier, and the larger the Acc is, the better the whole classification capability of the classifier is. The calculation formula is as follows:

in the formula, the number of the samples with the TP (true example) being true is detected as the number of the true examples, namely the correct true examples are detected; the FP (false positive example) is the sample whose true condition is negative, and is detected as the number of positive examples, namely the negative example of the detection error; FN (false negative) is the true case is the positive sample is detected as the number of the negative examples, namely the classification error positive example; the sample whose true TN is true is detected as the number of counter-examples, i.e. the correct counter-example is detected.

According to the test detection results, after the feature selection algorithm is used for optimally selecting the software features, the detection rate and the accuracy of the detection model obtained by training the software features by using four different machine learning classification algorithms are higher than those of the detection model obtained by the traditional malware detection method without using the feature selection algorithm, and the false alarm rate of the detection model obtained by the malware detection method is lower than that of the detection model obtained by the traditional malware detection method.

The malicious software detection method provided by the invention has higher detection efficiency and better detection effect.

In another aspect of the invention, a malware detection apparatus is provided.

In some embodiments of a malware detection apparatus provided by the present invention, the apparatus comprises:

In another aspect of the invention, a malware detection electronic device is provided.

In some embodiments of the present invention, an electronic device for malware detection includes:

a memory, a processor, and a computer program stored on the memory and executable on the processor.

When the processor executes the program, the malicious software detection method provided by the invention is realized.

The apparatus and the electronic device of the foregoing embodiments are used to implement the corresponding method in the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A malware detection method, comprising:

training a feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a detection model;

wherein the filtering the invalid features in the sample set feature set by using the feature selection algorithm to obtain the optimal feature subset comprises:

the method comprises the following steps: initializing and setting the sample set feature set, the relevant parameter constants in the sample set feature matrix and the relevant parameters used in the subset generation process, including:

characterizing said sample set as F_vThe sample set feature matrix is X_trainThe number of selected features is M_v(ii) a Setting an initial threshold value of information gain to a specific value theta_igSetting the information gain step length as lambda, setting the initial value n of the information gain circulation step number as 0, and setting the detection rate threshold value as 0.95; utilizing a machine learning classification algorithm to perform a feature matrix X on the sample set_trainTraining is carried out, and the maximum detection rate is recorded as TP_max；

the characteristic frequency calculation formula is as follows:

is characterized by^jThe number of samples present;

the calculating the characteristic frequency of each characteristic in the sample set characteristic set according to the characteristic frequency calculation formula, and filtering out irrelevant characteristics through calculation and comparison to obtain a irrelevant characteristic subset, comprises:

And step 3: the intermediate feature subsets are classified by machine learning

And 4, step 4: filtering out the intermediate feature subset

The feature subsets are classified by machine learning

Intermediate feature subset as new

Step 6: subset the intermediate features

Is expressed as a feature subset F_v1The number of selected features is M_v1Said subset of features F_v1I.e. the decorrelated feature subset;

the information gain calculation formula is as follows:

IG(f^j)＝H(Y)-H(Y|f^j)

the x²The statistical value calculation formula is as follows:

CHI(fⁱ，f^j)＝ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

2. The method of claim 1, wherein the extracting feature information of the sample set software, and abstracting the feature information into a digital form to obtain a sample set feature set and a sample set feature matrix comprises:

3. The method of claim 1, wherein step three comprises:

and step 3: selecting a composition satisfying IG > (theta)_igFeatures of- (n-1) lambda) constitute a subset of features

The number of selected features is recorded as

Selecting a composition satisfying IG > (theta)_ig-n λ) of features constituting a feature subset

The number of selected features is recorded as

And 4, step 4: comparison

And

a value of, if

Returning to the step 2; if it is not

Subset of output features

And 5: the feature subset to be output

4. The method of claim 3, wherein the fourth step comprises:

The number of redundant features selected is

And step 3: setting the cycle step number m to be 0;

and 5: according to the redundant feature set

The number of selected features is

Feature subsets by machine learning classification algorithms

Step 6: comparing m with

A value of, if

Returning to the step 4; otherwise, executing the next step;

and 7: comparing all detection rates

The maximum detection rate was recorded as

Maximum detection rate

5. The method according to claim 4, wherein the step five is specifically:

comparing the detection rate in the fourth step

And the maximum detection rate TP_maxComparing the two values, and assigning the larger value to TP_max(ii) a The TP is_maxComparing with the initial set detection rate threshold value of 0.95 if TP_maxIf the value is less than 0.95, returning to the third step; if TP_maxGreater than or equal to 0.95, then TP_maxThe corresponding feature subset is the optimal feature subset, and the optimal feature subset is denoted as F_v。

6. A malware detection apparatus, comprising:

the characteristic extraction module is used for extracting the characteristic information of the sample set software, abstracting the characteristic information into a digital form and obtaining a sample set characteristic set and a sample set characteristic matrix;

the subset generation module is used for filtering invalid features in the feature set by using a feature selection algorithm to obtain an optimal feature subset;

the detection model generation module is used for training the feature matrix corresponding to the optimal feature subset by adopting a machine learning classification algorithm to generate a detection model;

the subset generation module filters invalid features in the sample set feature set by using a feature selection algorithm to obtain an optimal feature subset, and the method comprises the following steps:

the characteristic frequency calculation formula is as follows:

is characterized by^jThe number of samples present;

the subset generation module calculates the characteristic frequency of each characteristic in the sample set characteristic set according to a characteristic frequency calculation formula, and obtains a decorrelation characteristic subset by calculating, comparing and filtering out irrelevant characteristics, and the subset generation module comprises:

And step 3: the intermediate feature subsets are classified by machine learning

And 4, step 4: filter elementExcluding the intermediate feature subset

The feature subsets are classified by machine learning

Intermediate feature subset as new

Step 6: subset the intermediate features

the information gain calculation formula is as follows:

IG(f^j)＝H(Y)-H(Y|f^j)

the x²The statistical value calculation formula is as follows:

CHI(fⁱ，f^j)＝ξ₁₁+ξ₁₂+ξ₂₁+ξ₂₂

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the computer program.