CN110135155B

CN110135155B - Fuzzy K neighbor-based Windows malicious software identification method

Info

Publication number: CN110135155B
Application number: CN201910260519.6A
Authority: CN
Inventors: 钱权; 唐明东
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2019-04-02
Filing date: 2019-04-02
Publication date: 2023-02-10
Anticipated expiration: 2039-04-02
Also published as: CN110135155A

Abstract

The invention discloses a fuzzy K neighbor-based Windows malicious software identification method. A sufficient amount of known malware and benign software is collected to make up the sample library. PE structure information of all samples in the sample library is extracted by using a disassembling technology. And then, calculating the fuzzy interval and the membership degree of the sample by using a fuzzy set theory so as to obtain the fuzzy eigenvector of the sample. And obtaining the fuzzy feature vector of the input sample to be detected by using the same method. And finding out a set with the maximum fuzzy interval matching degree according to a maximum fuzzy interval matching principle, and finding out K samples with the minimum Euclidean distance from the set to the sample to be detected. And sorting according to the distance from small to large, and using the reciprocal of a sorting subscript as a voting weight. And counting the sum of voting weights of all the categories, and using the category with the largest sum of weights as a prediction label. The method has the advantages of simplicity and high efficiency, and the detection efficiency is well improved on the basis of ensuring the accuracy.

Description

Fuzzy K neighbor-based Windows malicious software identification method

Technical Field

The invention relates to a plurality of fields of information security, reverse engineering, machine learning and the like, in particular to a Windows malicious software identification method based on fuzzy K neighbor.

Background

Malware refers to any software that is purposely designed to cause damage to a computer, server, or computer network. Such as computer viruses, worms, trojans, etc. With the wide use of the automatic generation technology and the obfuscation technology of the malicious software, the number of malicious codes is increased explosively, and the difficulty of analysis is increased sharply. The traditional detection method based on the label matching cannot meet the new requirements for safety under new conditions.

At present, the analysis methods of malware are mainly classified into static detection methods and dynamic detection methods. 1) The static detection method mainly analyzes on machine codes and assembly instructions. Such as analyzing malware opcode sequences, control flow graphs, PE structure information, etc. Because static analysis is to analyze binary codes, rich original information can be obtained, and larger code coverage rate is realized. However, static detection also has disadvantages, such as confusion, multi-state techniques, etc., which can bypass the conventional static detection. 2) Dynamic analysis methods analyze programs primarily by monitoring their behavior during execution. Such as file read and write behavior during execution, registry operation behavior, network behavior, API call behavior, and the like. Moreover, the dynamic analysis method can deal with the interference of the confusion technology. Likewise, dynamic analysis is disadvantageous because malware samples need to be executed in a secure and controlled environment in order to monitor their behavior. However, there is a difference between the secure environment and the actual operating environment, so that the malware shows different behavior patterns, thereby causing the recorded behavior log to be inaccurate. In addition, malware executes only one executable path at each dynamic analysis. Then, how to guarantee maximum coverage of the execution path is another challenge.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a Windows malicious software identification method based on fuzzy K neighbor, which is an identification method based on static analysis.

In order to achieve the purpose, the invention has the following conception:

first, a sufficient amount of known malware and benign software is collected to make up a sample library. Then, PE structure information of all samples in the sample library is extracted by using a disassembling technology. And then, calculating the fuzzy interval and the membership degree of the sample by using a fuzzy set theory so as to obtain the fuzzy characteristic vector of the sample. And for the input sample to be detected, obtaining the fuzzy characteristic vector thereof by using the same method. And finding out a set with the maximum fuzzy interval matching degree according to a maximum fuzzy interval matching principle, and finding out K samples with the minimum Euclidean distance from the set to the sample to be detected. And sorting according to the distance from small to large, and using the reciprocal of a sorting subscript as a voting weight. And counting the sum of voting weights of all the categories, and using the category with the largest sum of weights as a prediction label. And effective identification of Windows malware is realized.

According to the conception, the invention adopts the following technical scheme:

a Windows malicious software identification method based on fuzzy K neighbor comprises the following operation steps:

step one, collecting a sufficient amount of malicious software sets M and a sufficient amount of benign software sets B to form a known sample library U, namely U = { M, B };

step two, extracting PE structure information of all samples in a sample library U;

step three, calculating the fuzzy interval and the membership degree of each dimension by using a fuzzy set theory, and finally obtaining the fuzzy characteristic vector of each sample in the sample library U;

step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating a fuzzy characteristic vector of the sample to be detected according to the method in the step three;

step five, finding L samples which are most matched with the fuzzy interval of the sample to be detected from a sample library U through maximum fuzzy interval matching;

sixthly, calculating Euclidean distances between the L samples found in the step five and the sample to be detected, and finding K samples closest to the samples, wherein L is larger than or equal to K;

step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;

and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.

Further, the operation steps of the third step are as follows:

a, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;

b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;

and C, calculating the fuzzy interval and the membership degree of each dimensionality of each sample one by one to obtain the fuzzy eigenvector of each sample.

Further, the operation steps of the fifth step are as follows:

step A, initializing a parameter N = a characteristic dimension number, initializing a parameter K, wherein K is more than or equal to 1 and less than or equal to the size of a sample library U, and initializing a fuzzy interval matching set S;

b, initializing a parameter i =1;

step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;

step D, i is increased progressively, if i is larger than the size of the sample library U, step E is executed, otherwise step C is executed;

step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased and the step B is returned.

Compared with the prior art, the method has the following advantages:

according to the invention, the interference of outlier sample points is filtered through fuzzy interval matching, and the voting weight is given to each nearest neighbor sample by using a subscript reciprocal weighting method, so that the condition of data imbalance can be better dealt with, and the accuracy and robustness of detection are improved. And, the static analysis method is used, so that the method has the advantages of simplicity and high efficiency. The detection efficiency is well improved on the basis of guaranteeing the accuracy rate.

Drawings

FIG. 1 is a flowchart of the Windows malware identification method of the present invention.

Fig. 2 is a process of fuzzy interval matching.

Fig. 3 is a voting decision process for the classification phase.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the method for identifying Windows malware based on fuzzy K-nearest neighbor in this embodiment includes the following steps:

step one, a sufficient amount of malware sets M and a sufficient amount of benign software sets B are collected to form a known sample library U, i.e., U = { M, B }.

Step two, extracting PE structure information of all samples in a sample library U; a total of 53 dimensions, as shown in table 1:

TABLE 1 extracted PE characteristic information

Step three, calculating the fuzzy interval and the membership degree of each dimension by using a fuzzy set theory, and finally obtaining the fuzzy characteristic vector of each sample in the sample library U, wherein the method specifically comprises the following steps:

And step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating the fuzzy characteristic vector of the sample to be detected according to the method in the step three.

Step five, through the maximum fuzzy interval matching, finding out L samples which are most matched with the fuzzy interval of the sample to be detected from the sample library U, as shown in fig. 2, the specific steps are as follows:

step A, initializing a parameter N =53, initializing a parameter K (K is more than or equal to 1 and less than or equal to the size of a sample library), and initializing a fuzzy interval matching set S;

b, initializing a parameter i =1;

step D, i is increased progressively (i.e. i + +), if i > the size of the sample library U, step E is executed, otherwise, step C is executed;

step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased progressively (namely N- -), and the step B is returned.

Sixthly, calculating Euclidean distances between the L samples found in the fifth step and the sample to be detected, and finding K samples closest to the samples, wherein L is more than or equal to K, as shown in figure 3;

Claims

1. A Windows malicious software identification method based on fuzzy K neighbors is characterized by comprising the following operation steps:

step three, calculating a fuzzy interval and a membership degree of each dimensionality by using a fuzzy set theory, and finally obtaining a fuzzy feature vector of each sample in a sample library U;

step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating a fuzzy feature vector of the sample to be detected according to the method in the step three;

2. The method for identifying Windows malware based on fuzzy K-neighbors as claimed in claim 1, wherein the operation steps of said step three are as follows:

step A, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;

and C, calculating the fuzzy interval and the membership degree of each dimension of each sample one by one to obtain the fuzzy eigenvector of each sample.

3. The Windows malware identification method based on fuzzy K neighbors of claim 1, wherein the operation steps of the fifth step are as follows:

step A, initializing a parameter N = a characteristic dimension number, initializing a parameter K, wherein K is more than or equal to 1 and is less than or equal to the size of a sample library U, and initializing a fuzzy interval matching set S;

b, initializing a parameter i =1;