CN110135155B - Fuzzy K neighbor-based Windows malicious software identification method - Google Patents

Fuzzy K neighbor-based Windows malicious software identification method Download PDF

Info

Publication number
CN110135155B
CN110135155B CN201910260519.6A CN201910260519A CN110135155B CN 110135155 B CN110135155 B CN 110135155B CN 201910260519 A CN201910260519 A CN 201910260519A CN 110135155 B CN110135155 B CN 110135155B
Authority
CN
China
Prior art keywords
fuzzy
sample
samples
detected
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910260519.6A
Other languages
Chinese (zh)
Other versions
CN110135155A (en
Inventor
钱权
唐明东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201910260519.6A priority Critical patent/CN110135155B/en
Publication of CN110135155A publication Critical patent/CN110135155A/en
Application granted granted Critical
Publication of CN110135155B publication Critical patent/CN110135155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a fuzzy K neighbor-based Windows malicious software identification method. A sufficient amount of known malware and benign software is collected to make up the sample library. PE structure information of all samples in the sample library is extracted by using a disassembling technology. And then, calculating the fuzzy interval and the membership degree of the sample by using a fuzzy set theory so as to obtain the fuzzy eigenvector of the sample. And obtaining the fuzzy feature vector of the input sample to be detected by using the same method. And finding out a set with the maximum fuzzy interval matching degree according to a maximum fuzzy interval matching principle, and finding out K samples with the minimum Euclidean distance from the set to the sample to be detected. And sorting according to the distance from small to large, and using the reciprocal of a sorting subscript as a voting weight. And counting the sum of voting weights of all the categories, and using the category with the largest sum of weights as a prediction label. The method has the advantages of simplicity and high efficiency, and the detection efficiency is well improved on the basis of ensuring the accuracy.

Description

Fuzzy K neighbor-based Windows malicious software identification method
Technical Field
The invention relates to a plurality of fields of information security, reverse engineering, machine learning and the like, in particular to a Windows malicious software identification method based on fuzzy K neighbor.
Background
Malware refers to any software that is purposely designed to cause damage to a computer, server, or computer network. Such as computer viruses, worms, trojans, etc. With the wide use of the automatic generation technology and the obfuscation technology of the malicious software, the number of malicious codes is increased explosively, and the difficulty of analysis is increased sharply. The traditional detection method based on the label matching cannot meet the new requirements for safety under new conditions.
At present, the analysis methods of malware are mainly classified into static detection methods and dynamic detection methods. 1) The static detection method mainly analyzes on machine codes and assembly instructions. Such as analyzing malware opcode sequences, control flow graphs, PE structure information, etc. Because static analysis is to analyze binary codes, rich original information can be obtained, and larger code coverage rate is realized. However, static detection also has disadvantages, such as confusion, multi-state techniques, etc., which can bypass the conventional static detection. 2) Dynamic analysis methods analyze programs primarily by monitoring their behavior during execution. Such as file read and write behavior during execution, registry operation behavior, network behavior, API call behavior, and the like. Moreover, the dynamic analysis method can deal with the interference of the confusion technology. Likewise, dynamic analysis is disadvantageous because malware samples need to be executed in a secure and controlled environment in order to monitor their behavior. However, there is a difference between the secure environment and the actual operating environment, so that the malware shows different behavior patterns, thereby causing the recorded behavior log to be inaccurate. In addition, malware executes only one executable path at each dynamic analysis. Then, how to guarantee maximum coverage of the execution path is another challenge.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a Windows malicious software identification method based on fuzzy K neighbor, which is an identification method based on static analysis.
In order to achieve the purpose, the invention has the following conception:
first, a sufficient amount of known malware and benign software is collected to make up a sample library. Then, PE structure information of all samples in the sample library is extracted by using a disassembling technology. And then, calculating the fuzzy interval and the membership degree of the sample by using a fuzzy set theory so as to obtain the fuzzy characteristic vector of the sample. And for the input sample to be detected, obtaining the fuzzy characteristic vector thereof by using the same method. And finding out a set with the maximum fuzzy interval matching degree according to a maximum fuzzy interval matching principle, and finding out K samples with the minimum Euclidean distance from the set to the sample to be detected. And sorting according to the distance from small to large, and using the reciprocal of a sorting subscript as a voting weight. And counting the sum of voting weights of all the categories, and using the category with the largest sum of weights as a prediction label. And effective identification of Windows malware is realized.
According to the conception, the invention adopts the following technical scheme:
a Windows malicious software identification method based on fuzzy K neighbor comprises the following operation steps:
step one, collecting a sufficient amount of malicious software sets M and a sufficient amount of benign software sets B to form a known sample library U, namely U = { M, B };
step two, extracting PE structure information of all samples in a sample library U;
step three, calculating the fuzzy interval and the membership degree of each dimension by using a fuzzy set theory, and finally obtaining the fuzzy characteristic vector of each sample in the sample library U;
step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating a fuzzy characteristic vector of the sample to be detected according to the method in the step three;
step five, finding L samples which are most matched with the fuzzy interval of the sample to be detected from a sample library U through maximum fuzzy interval matching;
sixthly, calculating Euclidean distances between the L samples found in the step five and the sample to be detected, and finding K samples closest to the samples, wherein L is larger than or equal to K;
step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;
and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.
Further, the operation steps of the third step are as follows:
a, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;
b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;
and C, calculating the fuzzy interval and the membership degree of each dimensionality of each sample one by one to obtain the fuzzy eigenvector of each sample.
Further, the operation steps of the fifth step are as follows:
step A, initializing a parameter N = a characteristic dimension number, initializing a parameter K, wherein K is more than or equal to 1 and less than or equal to the size of a sample library U, and initializing a fuzzy interval matching set S;
b, initializing a parameter i =1;
step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;
step D, i is increased progressively, if i is larger than the size of the sample library U, step E is executed, otherwise step C is executed;
step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased and the step B is returned.
Compared with the prior art, the method has the following advantages:
according to the invention, the interference of outlier sample points is filtered through fuzzy interval matching, and the voting weight is given to each nearest neighbor sample by using a subscript reciprocal weighting method, so that the condition of data imbalance can be better dealt with, and the accuracy and robustness of detection are improved. And, the static analysis method is used, so that the method has the advantages of simplicity and high efficiency. The detection efficiency is well improved on the basis of guaranteeing the accuracy rate.
Drawings
FIG. 1 is a flowchart of the Windows malware identification method of the present invention.
Fig. 2 is a process of fuzzy interval matching.
Fig. 3 is a voting decision process for the classification phase.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the method for identifying Windows malware based on fuzzy K-nearest neighbor in this embodiment includes the following steps:
step one, a sufficient amount of malware sets M and a sufficient amount of benign software sets B are collected to form a known sample library U, i.e., U = { M, B }.
Step two, extracting PE structure information of all samples in a sample library U; a total of 53 dimensions, as shown in table 1:
TABLE 1 extracted PE characteristic information
Figure BDA0002015143050000031
Step three, calculating the fuzzy interval and the membership degree of each dimension by using a fuzzy set theory, and finally obtaining the fuzzy characteristic vector of each sample in the sample library U, wherein the method specifically comprises the following steps:
a, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;
b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;
and C, calculating the fuzzy interval and the membership degree of each dimensionality of each sample one by one to obtain the fuzzy eigenvector of each sample.
And step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating the fuzzy characteristic vector of the sample to be detected according to the method in the step three.
Step five, through the maximum fuzzy interval matching, finding out L samples which are most matched with the fuzzy interval of the sample to be detected from the sample library U, as shown in fig. 2, the specific steps are as follows:
step A, initializing a parameter N =53, initializing a parameter K (K is more than or equal to 1 and less than or equal to the size of a sample library), and initializing a fuzzy interval matching set S;
b, initializing a parameter i =1;
step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;
step D, i is increased progressively (i.e. i + +), if i > the size of the sample library U, step E is executed, otherwise, step C is executed;
step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased progressively (namely N- -), and the step B is returned.
Sixthly, calculating Euclidean distances between the L samples found in the fifth step and the sample to be detected, and finding K samples closest to the samples, wherein L is more than or equal to K, as shown in figure 3;
step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;
and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.

Claims (3)

1. A Windows malicious software identification method based on fuzzy K neighbors is characterized by comprising the following operation steps:
step one, collecting a sufficient amount of malicious software sets M and a sufficient amount of benign software sets B to form a known sample library U, namely U = { M, B };
step two, extracting PE structure information of all samples in a sample library U;
step three, calculating a fuzzy interval and a membership degree of each dimensionality by using a fuzzy set theory, and finally obtaining a fuzzy feature vector of each sample in a sample library U;
step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating a fuzzy feature vector of the sample to be detected according to the method in the step three;
step five, finding L samples which are most matched with the fuzzy interval of the sample to be detected from a sample library U through maximum fuzzy interval matching;
sixthly, calculating Euclidean distances between the L samples found in the step five and the sample to be detected, and finding K samples closest to the samples, wherein L is larger than or equal to K;
step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;
and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.
2. The method for identifying Windows malware based on fuzzy K-neighbors as claimed in claim 1, wherein the operation steps of said step three are as follows:
step A, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;
b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;
and C, calculating the fuzzy interval and the membership degree of each dimension of each sample one by one to obtain the fuzzy eigenvector of each sample.
3. The Windows malware identification method based on fuzzy K neighbors of claim 1, wherein the operation steps of the fifth step are as follows:
step A, initializing a parameter N = a characteristic dimension number, initializing a parameter K, wherein K is more than or equal to 1 and is less than or equal to the size of a sample library U, and initializing a fuzzy interval matching set S;
b, initializing a parameter i =1;
step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;
step D, i is increased progressively, if i is larger than the size of the sample library U, step E is executed, otherwise step C is executed;
step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased and the step B is returned.
CN201910260519.6A 2019-04-02 2019-04-02 Fuzzy K neighbor-based Windows malicious software identification method Active CN110135155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910260519.6A CN110135155B (en) 2019-04-02 2019-04-02 Fuzzy K neighbor-based Windows malicious software identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910260519.6A CN110135155B (en) 2019-04-02 2019-04-02 Fuzzy K neighbor-based Windows malicious software identification method

Publications (2)

Publication Number Publication Date
CN110135155A CN110135155A (en) 2019-08-16
CN110135155B true CN110135155B (en) 2023-02-10

Family

ID=67568996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910260519.6A Active CN110135155B (en) 2019-04-02 2019-04-02 Fuzzy K neighbor-based Windows malicious software identification method

Country Status (1)

Country Link
CN (1) CN110135155B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476297A (en) * 2020-04-07 2020-07-31 中国民航信息网络股份有限公司 Category determination method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106101252A (en) * 2016-07-01 2016-11-09 何钟柱 Information Security Risk guard system based on big data and trust computing
CN107273746A (en) * 2017-05-18 2017-10-20 广东工业大学 A kind of mutation malware detection method based on APK character string features
CN109255363A (en) * 2018-07-11 2019-01-22 齐鲁工业大学 A kind of fuzzy k nearest neighbor classification method and system based on weighted chi-square distance metric

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9652688B2 (en) * 2014-11-26 2017-05-16 Captricity, Inc. Analyzing content of digital images
US11481492B2 (en) * 2017-07-25 2022-10-25 Trend Micro Incorporated Method and system for static behavior-predictive malware detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106101252A (en) * 2016-07-01 2016-11-09 何钟柱 Information Security Risk guard system based on big data and trust computing
CN107273746A (en) * 2017-05-18 2017-10-20 广东工业大学 A kind of mutation malware detection method based on APK character string features
CN109255363A (en) * 2018-07-11 2019-01-22 齐鲁工业大学 A kind of fuzzy k nearest neighbor classification method and system based on weighted chi-square distance metric

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Dynamic API call sequence visualisation for malware classification;Mingdong Tang;《IET Information Security》;20190306;第368-377页 *
基于FKNN算法的GIS运行状态评估研究;方钦等;《湖北工业大学学报》;20180430;第62-66页 *

Also Published As

Publication number Publication date
CN110135155A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
Fan et al. Malicious sequential pattern mining for automatic malware detection
Alasmary et al. Analyzing and detecting emerging Internet of Things malware: A graph-based approach
US10200391B2 (en) Detection of malware in derived pattern space
Tian et al. An automated classification system based on the strings of trojan and virus families
CN107315956B (en) It is a kind of for quick and precisely detecting the Graph-theoretical Approach of Malware on the zero
Parsazad et al. Fast feature reduction in intrusion detection datasets
Morales-Molina et al. Methodology for malware classification using a random forest classifier
Imran et al. Using hidden markov model for dynamic malware analysis: First impressions
Elkhawas et al. Malware detection using opcode trigram sequence with SVM
Jang et al. Mal-netminer: malware classification based on social network analysis of call graph
CN111259397A (en) Malware classification method based on Markov graph and deep learning
Wan et al. IoT-malware detection based on byte sequences of executable files
San et al. Malicious software family classification using machine learning multi-class classifiers
Park et al. Birds of a feature: Intrafamily clustering for version identification of packed malware
Khan et al. Determining malicious executable distinguishing attributes and low-complexity detection
Sivakumar et al. Malware Detection Using The Machine Learning Based Modified Partial Swarm Optimization Approach
Li et al. MDBA: Detecting malware based on bytes n-gram with association mining
CN110135155B (en) Fuzzy K neighbor-based Windows malicious software identification method
CN112257076B (en) Vulnerability detection method based on random detection algorithm and information aggregation
Kim et al. Malicious behavior detection method using api sequence in binary execution path
Pranav et al. Detection of botnets in IoT networks using graph theory and machine learning
Muhaya et al. Polymorphic malware detection using hierarchical hidden markov model
CN115118482B (en) Industrial control system intrusion detection clue analysis and tracing method, system and terminal
CN115545091A (en) Integrated learner-based malicious program API (application program interface) calling sequence detection method
US11977633B2 (en) Augmented machine learning malware detection based on static and dynamic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant