CN111753299A

CN111753299A - Unbalanced malicious software detection method based on packet integration

Info

Publication number: CN111753299A
Application number: CN202010571828.8A
Authority: CN
Inventors: 严海升; 李强
Original assignee: Chongqing University of Arts and Sciences
Current assignee: Chongqing University of Arts and Sciences
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-09

Abstract

The invention belongs to the technical field of information security, and particularly relates to an unbalanced malicious software detection method based on packet integration, which comprises the following steps: s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples; s2, feature optimization: screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics to obtain an unbalanced data set; and S3, detecting the unbalanced data set by utilizing a grouping integration detection algorithm so as to classify the normal samples and the malicious samples. The method and the device solve the defect that the accuracy and the stability of the detection of the malicious software are difficult to guarantee due to the unbalanced data set.

Description

Unbalanced malicious software detection method based on packet integration

Technical Field

The invention belongs to the technical field of information security, and particularly relates to an unbalanced malicious software detection method based on packet integration.

Background

The Android platform is popular among a large number of mobile phone manufacturers due to the open source characteristic, the Android mobile phone occupies 87% of market share according to the latest statistics of IDC, meanwhile, the Android platform is vulnerable to malicious software due to the open source characteristic, and 97% of discovered mobile phone end malicious software is related to the Android platform. Malicious molecules attack the Android platform by using malicious software to steal user privacy information, carry out malicious fee deduction and the like, and the security situation of the mobile phone is very severe, so that the detection of the malicious software becomes a research focus in the field of information security.

Machine learning has enjoyed tremendous success in the information security field of spam filtering and the like. Researchers apply machine learning algorithms to the field of Android malware detection, provide a plurality of malware detection algorithms, and verify the effectiveness of the machine learning algorithms in malware detection problems. The method provides a lightweight detection scheme based on sensitive authority, analyzes the difference of authority in different types of samples, removes redundant authority, and finally adopts a nearest neighbor classification algorithm to realize discrimination on malicious software. Zhang et al put forward an Android malicious software detection strategy based on naive Bayes, and discrimination of malicious software is realized by judging whether abuse authority and sensitive authority are supplied in series or not as characteristic attributes. Poplar and macro and the like extract authority and component intention information of android software as features, and a random forest algorithm optimized by weighted voting is adopted to detect malicious software. Although many methods for detecting malware have been proposed, most of these methods assume that malicious software and normal software in the training data do not differ greatly in number. However, in practical applications, because normal samples can be obtained from a third-party market in batches through a crawler, the collection cost of the malicious software samples is high, the difficulty is high, the number of the normal software samples is far greater than that of the malicious software samples, and the problem of imbalance of training data is caused, so that the accuracy and the stability of the malicious software detection method are difficult to guarantee. In the detection of the malicious software, data imbalance is caused due to the high collection cost of the malicious software sample and the like.

Disclosure of Invention

In order to solve the problem of low detection and classification precision of malicious software caused by data imbalance, the invention provides an unbalanced malicious software detection method based on packet integration, and the effectiveness of the method in solving the problem of data imbalance is verified on a real data set.

In order to achieve the purpose, the invention adopts the following technical scheme:

an unbalanced malware detection method based on packet integration comprises the following steps:

s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples;

s2, feature optimization: screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics to obtain an unbalanced data set;

and S3, detecting the unbalanced data set by utilizing a grouping integration detection algorithm so as to classify the normal samples and the malicious samples.

Preferably, the step S3 specifically includes the following steps:

s31, randomly extracting three data sets from the unbalanced data set to be respectively used as a training data set, a verification data set and a test data set; the number of normal samples and the number of malicious samples in the training data set are respectively recorded as b and m;

s32, randomly and unreplaceably extracting m samples from normal samples of the training data set and combining the m samples with m malicious samples to form a new data set D_i(ii) a Extracting k times to form k balanced data sets; wherein k is b/m;

s33, for each data set D_iTraining by adopting a decision tree, training k decision tree classifiers, sequentially testing the classification performance of each decision tree classifier t on a verification data set, and calculating the recall rate and recording the recall rate as r_t(ii) a For the decision tree classifier t, if the malicious sample is wrongly classified, adding the wrongly classified sample into the next decision tree classifier for training to form k base classifiers;

s34, combining the k base classifiers into an integrated decision tree classifier C in a weight voting mode;

s35, inputting x into k basis classifiers in the decision tree integrated classifier C for each test sample x in the test data set, and calculating weight voting results of the k basis classifiers, wherein the calculation formula is as follows:

wherein r is_tIs the recall of decision tree classifier t;

T_c,x(x) Is defined as follows:

when counting the votes of the malicious samples, the class c is the malicious sample, and the class non-c is the normal sample;

when counting the ticket number of the normal sample, the class c is the normal sample, and the class non-c is the malicious sample;

and calculating the total ticket number of the samples which are judged to be the malicious samples and the normal samples, and selecting the category with the most tickets as the final category of the sample x.

Preferably, the step S1 specifically includes the following steps:

s11, writing a Python program to read the authority and API calling information in the experiment sample to form a feature set;

s12, carrying out duplication elimination processing on the feature set to form a new feature set FS;

s13, judging whether the samples contain corresponding elements in the new feature set FS or not according to all the samples; if the sample contains the corresponding feature in the FS set, the corresponding element of the feature vector is represented by 1; otherwise, the corresponding element is represented by 0; all samples are traversed to form a feature vector set FVS.

Preferably, the step S1 further includes:

and adding a flag bit at the end of each feature vector, wherein 0 represents a normal sample, and 1 represents a malicious sample.

Preferably, the proportion of the training data set in the unbalanced data set is greater than 50%.

Preferably, the proportion of the training data set in the unbalanced data set is 60%, the proportion of the verification data set in the unbalanced data set is 20%, and the proportion of the test data set in the unbalanced data set is 20%.

Preferably, in step S2, the information gain algorithm calculates a difference between the entropy value of the feature and the conditional entropy thereof to obtain an IG value of the feature, and the larger the IG value is, the more important the feature is.

Preferably, in step S2, the two indexes of recall rate recall and G-mean are used as metrics for screening, which are as follows:

if the samples are predicted to be malicious samples, the samples are actually malicious samples, and the number of the malicious samples which are predicted to be correct is recorded as TP;

if the samples are predicted to be normal samples and actually malicious samples, the number of the malicious sample prediction errors is recorded as FP;

if the samples are predicted to be malicious samples and actually normal samples, the number of prediction errors of the normal samples is recorded as FN;

if the normal samples are predicted, the normal samples are actually normal samples, and the number of the normal samples which are predicted correctly is recorded as TN;

compared with the prior art, the invention has the beneficial effects that:

the unbalanced malicious software detection method based on the grouping integration utilizes the grouping integration detection algorithm, namely, the normal software samples are divided into a plurality of groups through a random sampling technology, then the normal software samples in each group and all the malicious software samples are used for training a classification model, and finally all the classification models are fused through the integration method, so that the defects that the accuracy and the stability of malicious software detection are difficult to guarantee due to the unbalanced data set are overcome.

Drawings

FIG. 1 is a block flow diagram of an unbalanced malware detection method based on packet integration according to an embodiment of the present invention;

FIG. 2 is a flow chart of a packet integration detection algorithm of an embodiment of the present invention;

FIG. 3 is a usage ranking graph of partial rights information in an experimental sample according to an embodiment of the invention;

FIG. 4 is a usage ranking graph of a portion of API call information in an experimental sample according to an embodiment of the present invention;

FIG. 5 is a simulation of the detection method of an embodiment of the present invention over different feature subset numbers.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

The invention provides an unbalanced malicious software detection method based on grouping integration for solving the problem of low detection and classification precision of malicious software caused by data imbalance, and verifies the effectiveness of the method in solving the unbalanced problem on a real data set.

Specifically, as shown in fig. 1, the unbalanced malware detection method based on packet integration according to the embodiment of the present invention is implemented by a feature extraction module, a feature optimization module, and a classification detection module, where the feature extraction module mainly uses Python language programming to extract authority and API call information from an experimental sample, so as to form a feature vector set; the characteristic optimization module mainly solves the characteristic redundancy, removes redundant characteristics through an information gain algorithm to prevent the overfitting phenomenon and improves the detection efficiency; the classification detection module provides a grouping integration detection algorithm to detect unbalanced data sets aiming at the deviation of most types of the traditional classifier.

As shown in fig. 2, the packet integration-based unbalanced malware detection method according to the embodiment of the present invention specifically includes:

s1, feature extraction: extracting authority information and API calling information from an experimental sample to form a characteristic vector set; the experimental samples comprise normal samples and malicious samples, and the number of the normal samples is larger than that of the malicious samples.

The method specifically comprises the following steps:

and 1, writing a Python program to read the authority and API calling information in the sample to form a feature set, wherein the Android self-defined authority only belongs to a specific sample and statistics is not carried out.

Step 2: and (4) carrying out duplication removal treatment on the features extracted in the step (1), and removing repeated features to form a new feature set FS.

And step 3: judging whether the samples contain corresponding elements in the FS set or not according to all the samples; if the sample contains the corresponding feature in the FS set, the corresponding element of the feature vector is represented by 1; otherwise, the corresponding element is represented by 0, all samples are traversed to form a feature vector set FVS, in addition, a flag bit is added at the end of each feature vector, 0 represents a normal sample, and 1 represents a malicious sample.

S2, feature optimization: and screening the characteristic vector set by adopting an information gain algorithm to remove redundant characteristics and obtain an unbalanced data set.

Android software relates to various authority features and API calling features, the dimension of a feature space is high, and dimension disasters are easily caused. In order to reduce the influence of redundant features on the detection effect, the detection efficiency is improved. In the embodiment of the invention, an information gain algorithm (IG) is adopted to screen out the characteristics with high category resolution capability from the original characteristic space.

(1) Information gain based feature correlation analysis

The information gain algorithm is used for measuring the difference of two probability distributions in an information theory, the difference is applied to feature selection, the measurement standard is the contribution degree of the features to classification, the algorithm obtains the IG value of the features by calculating the difference value of the entropy value of the features and the conditional entropy thereof, and the larger the value is, the higher the correlation degree is.

The calculation formula is as follows:

wherein m represents the total number of categories of the classification; p (t) represents the probability of occurrence of feature t; p (C)_i) Represents class C_iProbability of occurrence, P (C)_i| t) represents feature t versus class C_iThe contribution of (a) to (b),

representing the probability that the extracted feature does not contain the feature t,

indicates that the extracted features in the training sample belong to C when the features do not include the feature t_iConditional probability of a class.

(2) Evaluation and selection of feature subsets

Two index quantities of recall rate and G-mean are used as the measurement indexes. The two indexes of the recall rate recall and the G-mean can be obtained by calculating through a table confusion matrix. Where the confusion matrix is shown in table 1.

TABLE 1 confusion matrix

	Prediction as positive class	Prediction as negative class
			Actually of positive type	TP	FN
Actually of negative type	FP	TN

The recall rates recall and G-mean are calculated by the following formulas.

The recall rate represents the accuracy of prediction of the malicious samples, the G-mean considers the whole effect of the classifier on classification of the positive and negative samples, and the value depends on the detection rates of the malicious software and the normal software, namely the screening of the feature vector set is adjusted through the final detection rates of the malicious software and the normal software.

Most of the traditional machine learning algorithms are predicated based on balanced samples, when unbalanced data sets are processed, classification results are biased to most of classes, but the predication effect of few classes is poor, the predication result of the few classes is more important, and if malicious samples are predicated to be normal samples, loss is caused to users. Therefore, the invention provides an unbalanced malware detection algorithm based on packet integration aiming at an unbalanced data set, and a decision tree is adopted as a base classifier.

(1) Data packet

Classifying the unbalanced data sets into a training data set, a verification data set and a test data set, and constructing k balanced data sets by random sampling aiming at the training data set; then training k decision tree classifiers C based on a balanced data set, testing the classification performance of the decision tree base classifiers by using a verification set, and adding wrongly-classified malicious data into a next decision tree base classifier for training; and finally, forming a decision tree integrated classifier C by the k decision tree base classifiers in a weight voting mode, and taking the output of the decision tree integrated classifier as a classification result of the test sample during classification detection.

The embodiment of the invention adopts the decision tree as the base classifier because the decision tree classifier is a weak classifier and has better classification performance than other strong classifiers.

(2) Integrated learning

After data grouping is carried out and a base classifier is trained, classification is integrated in a weighted voting mode.

Specifically, the flow of the packet integration detection algorithm is shown in fig. 2, and the detailed steps are as follows:

step 1: respectively randomly extracting 60% of training data sets, 20% of verification data sets and 20% of testing data sets from a normal data set and a malicious data set (namely an unbalanced data set); in the training data set, the number of normal samples and the number of malicious samples are respectively recorded as b and m;

step 2: randomly and unreplaceably extracting m samples from normal training data samples and m malicious samples to synthesize a new data set D_i(ii) a A total of k decimation, where k is b/m, together forming k balanced datasets.

And step 3: for each data set D_iTraining by adopting a decision tree, training k decision tree classifiers in total, testing the classification performance of the decision tree classifier t on a verification data set, and calculating the recall rate of the decision tree classifier t and recording the recall rate as r_t. And for the decision tree classifier t, if the malicious sample is wrongly classified, adding the wrongly classified sample into the next decision tree classifier for training. So as to form k base classifiers in total,

and combining the k base classifiers into an integrated decision tree classifier C in a weight voting mode.

And 4, step 4: inputting x into k base classifiers in a decision tree integrated classifier C for each test sample x in a test data set, and calculating weight voting results of the k base classifiers, wherein the formula is as follows:

wherein r is_tIs the recall of decision tree classifier t;

T_c,x(x) Is defined as follows:

when counting the votes of the malicious samples, the class c is the malicious sample, and the class non-c is the normal sample; when counting the votes of the normal samples, the c type is the normal sample, and the non-c type is the malicious sample. When counting the ticket number of the normal sample, if the sample x is judged as the normal sample by the base classifier, T_c,x(x) 1 is ═ 1; if sample x is judged to be a malicious sample by the base classifier, T_c,x(x) 0. When counting the number of the malicious samples, T is the time when the sample x is judged to be the malicious sample by the base classifier_c,x(x) 1 is ═ 1; if sample x is judged to be a normal sample by the base classifier, T_c,x(x)＝0。

The packet integration-based unbalanced malware detection method of the embodiment of the invention has the following effective application:

the experimental data come from Drebin website, and contain 123453 normal samples and 5560 malicious samples.

The adopted data set comprises 123453 normal samples and 5560 malicious samples, each sample comprises 545 characteristics, and the usage rate ordering of partial authority information and API call information in the malicious samples and the normal samples is shown in figures 3 and 4.

The IG value of each feature was calculated by an IG information gain algorithm, and the larger the IG value, the more important the feature was, among which the top-ranked feature attributes are shown in table 2.

TABLE 2 characteristics and their corresponding IG values

After screening, an optimal feature subset (i.e., an unbalanced data set) is obtained, and simulation is performed on the number of different feature subsets according to the packet integration detection algorithm provided by the invention, wherein the simulation result is shown in fig. 5.

On an experimental data set, taking the pre-ranking 70 attribute calculated by an IG algorithm as an input feature, a comparison experiment is carried out on the classification detection strategy of the invention, the kNN, svm and the RF algorithm provided by a sklern packet in Python, and the result is shown in Table 3.

TABLE 3 comparison test data of the packet integration detection algorithm of the present invention with other existing algorithms

Index (I)	kNN	svm	RF	Algorithm of the invention
					TP	877	855	940	1035
TN	25511	25554	25525	24659
					FP	86	43	72	938
FN	235	257	172	77
					recall	0.788	0.768	0.846	0.931
G-mean	0.866	0.866	0.918	0.947

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. An unbalanced malware detection method based on packet integration is characterized by comprising the following steps:

2. The unbalanced malware detection method based on packet integration as claimed in claim 1, wherein the step S3 specifically comprises the following steps:

wherein r is_tIs the recall of decision tree classifier t;

T_c,x(x) Is defined as follows:

3. The unbalanced malware detection method based on packet integration as claimed in claim 2, wherein the step S1 specifically comprises the following steps:

4. The unbalanced malware detection method based on packet integration as claimed in claim 3, wherein the step S1 further comprises:

5. The method according to claim 2, wherein the proportion of the training data set in the unbalanced data set is greater than 50%.

6. The method according to claim 2, wherein the proportion of the training data set in the unbalanced data set is 60%, the proportion of the verification data set in the unbalanced data set is 20%, and the proportion of the test data set in the unbalanced data set is 20%.

7. The method as claimed in claim 1, wherein in step S2, the information gain algorithm calculates the difference between the entropy and the conditional entropy of the features to obtain the IG value of the features, and the larger the IG value is, the more important the features are.

8. The unbalanced malware detection method based on packet integration as claimed in claim 1, wherein in the step S2, two indexes of recall rate call and G-mean are used as metrics for screening, specifically as follows: