CN110135155B - Fuzzy K neighbor-based Windows malicious software identification method - Google Patents
Fuzzy K neighbor-based Windows malicious software identification method Download PDFInfo
- Publication number
- CN110135155B CN110135155B CN201910260519.6A CN201910260519A CN110135155B CN 110135155 B CN110135155 B CN 110135155B CN 201910260519 A CN201910260519 A CN 201910260519A CN 110135155 B CN110135155 B CN 110135155B
- Authority
- CN
- China
- Prior art keywords
- fuzzy
- sample
- samples
- detected
- calculating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Virology (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a fuzzy K neighbor-based Windows malicious software identification method. A sufficient amount of known malware and benign software is collected to make up the sample library. PE structure information of all samples in the sample library is extracted by using a disassembling technology. And then, calculating the fuzzy interval and the membership degree of the sample by using a fuzzy set theory so as to obtain the fuzzy eigenvector of the sample. And obtaining the fuzzy feature vector of the input sample to be detected by using the same method. And finding out a set with the maximum fuzzy interval matching degree according to a maximum fuzzy interval matching principle, and finding out K samples with the minimum Euclidean distance from the set to the sample to be detected. And sorting according to the distance from small to large, and using the reciprocal of a sorting subscript as a voting weight. And counting the sum of voting weights of all the categories, and using the category with the largest sum of weights as a prediction label. The method has the advantages of simplicity and high efficiency, and the detection efficiency is well improved on the basis of ensuring the accuracy.
Description
Technical Field
The invention relates to a plurality of fields of information security, reverse engineering, machine learning and the like, in particular to a Windows malicious software identification method based on fuzzy K neighbor.
Background
Malware refers to any software that is purposely designed to cause damage to a computer, server, or computer network. Such as computer viruses, worms, trojans, etc. With the wide use of the automatic generation technology and the obfuscation technology of the malicious software, the number of malicious codes is increased explosively, and the difficulty of analysis is increased sharply. The traditional detection method based on the label matching cannot meet the new requirements for safety under new conditions.
At present, the analysis methods of malware are mainly classified into static detection methods and dynamic detection methods. 1) The static detection method mainly analyzes on machine codes and assembly instructions. Such as analyzing malware opcode sequences, control flow graphs, PE structure information, etc. Because static analysis is to analyze binary codes, rich original information can be obtained, and larger code coverage rate is realized. However, static detection also has disadvantages, such as confusion, multi-state techniques, etc., which can bypass the conventional static detection. 2) Dynamic analysis methods analyze programs primarily by monitoring their behavior during execution. Such as file read and write behavior during execution, registry operation behavior, network behavior, API call behavior, and the like. Moreover, the dynamic analysis method can deal with the interference of the confusion technology. Likewise, dynamic analysis is disadvantageous because malware samples need to be executed in a secure and controlled environment in order to monitor their behavior. However, there is a difference between the secure environment and the actual operating environment, so that the malware shows different behavior patterns, thereby causing the recorded behavior log to be inaccurate. In addition, malware executes only one executable path at each dynamic analysis. Then, how to guarantee maximum coverage of the execution path is another challenge.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a Windows malicious software identification method based on fuzzy K neighbor, which is an identification method based on static analysis.
In order to achieve the purpose, the invention has the following conception:
first, a sufficient amount of known malware and benign software is collected to make up a sample library. Then, PE structure information of all samples in the sample library is extracted by using a disassembling technology. And then, calculating the fuzzy interval and the membership degree of the sample by using a fuzzy set theory so as to obtain the fuzzy characteristic vector of the sample. And for the input sample to be detected, obtaining the fuzzy characteristic vector thereof by using the same method. And finding out a set with the maximum fuzzy interval matching degree according to a maximum fuzzy interval matching principle, and finding out K samples with the minimum Euclidean distance from the set to the sample to be detected. And sorting according to the distance from small to large, and using the reciprocal of a sorting subscript as a voting weight. And counting the sum of voting weights of all the categories, and using the category with the largest sum of weights as a prediction label. And effective identification of Windows malware is realized.
According to the conception, the invention adopts the following technical scheme:
a Windows malicious software identification method based on fuzzy K neighbor comprises the following operation steps:
step one, collecting a sufficient amount of malicious software sets M and a sufficient amount of benign software sets B to form a known sample library U, namely U = { M, B };
step two, extracting PE structure information of all samples in a sample library U;
step three, calculating the fuzzy interval and the membership degree of each dimension by using a fuzzy set theory, and finally obtaining the fuzzy characteristic vector of each sample in the sample library U;
step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating a fuzzy characteristic vector of the sample to be detected according to the method in the step three;
step five, finding L samples which are most matched with the fuzzy interval of the sample to be detected from a sample library U through maximum fuzzy interval matching;
sixthly, calculating Euclidean distances between the L samples found in the step five and the sample to be detected, and finding K samples closest to the samples, wherein L is larger than or equal to K;
step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;
and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.
Further, the operation steps of the third step are as follows:
a, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;
b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;
and C, calculating the fuzzy interval and the membership degree of each dimensionality of each sample one by one to obtain the fuzzy eigenvector of each sample.
Further, the operation steps of the fifth step are as follows:
step A, initializing a parameter N = a characteristic dimension number, initializing a parameter K, wherein K is more than or equal to 1 and less than or equal to the size of a sample library U, and initializing a fuzzy interval matching set S;
b, initializing a parameter i =1;
step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;
step D, i is increased progressively, if i is larger than the size of the sample library U, step E is executed, otherwise step C is executed;
step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased and the step B is returned.
Compared with the prior art, the method has the following advantages:
according to the invention, the interference of outlier sample points is filtered through fuzzy interval matching, and the voting weight is given to each nearest neighbor sample by using a subscript reciprocal weighting method, so that the condition of data imbalance can be better dealt with, and the accuracy and robustness of detection are improved. And, the static analysis method is used, so that the method has the advantages of simplicity and high efficiency. The detection efficiency is well improved on the basis of guaranteeing the accuracy rate.
Drawings
FIG. 1 is a flowchart of the Windows malware identification method of the present invention.
Fig. 2 is a process of fuzzy interval matching.
Fig. 3 is a voting decision process for the classification phase.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the method for identifying Windows malware based on fuzzy K-nearest neighbor in this embodiment includes the following steps:
step one, a sufficient amount of malware sets M and a sufficient amount of benign software sets B are collected to form a known sample library U, i.e., U = { M, B }.
Step two, extracting PE structure information of all samples in a sample library U; a total of 53 dimensions, as shown in table 1:
TABLE 1 extracted PE characteristic information
Step three, calculating the fuzzy interval and the membership degree of each dimension by using a fuzzy set theory, and finally obtaining the fuzzy characteristic vector of each sample in the sample library U, wherein the method specifically comprises the following steps:
a, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;
b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;
and C, calculating the fuzzy interval and the membership degree of each dimensionality of each sample one by one to obtain the fuzzy eigenvector of each sample.
And step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating the fuzzy characteristic vector of the sample to be detected according to the method in the step three.
Step five, through the maximum fuzzy interval matching, finding out L samples which are most matched with the fuzzy interval of the sample to be detected from the sample library U, as shown in fig. 2, the specific steps are as follows:
step A, initializing a parameter N =53, initializing a parameter K (K is more than or equal to 1 and less than or equal to the size of a sample library), and initializing a fuzzy interval matching set S;
b, initializing a parameter i =1;
step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;
step D, i is increased progressively (i.e. i + +), if i > the size of the sample library U, step E is executed, otherwise, step C is executed;
step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased progressively (namely N- -), and the step B is returned.
Sixthly, calculating Euclidean distances between the L samples found in the fifth step and the sample to be detected, and finding K samples closest to the samples, wherein L is more than or equal to K, as shown in figure 3;
step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;
and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.
Claims (3)
1. A Windows malicious software identification method based on fuzzy K neighbors is characterized by comprising the following operation steps:
step one, collecting a sufficient amount of malicious software sets M and a sufficient amount of benign software sets B to form a known sample library U, namely U = { M, B };
step two, extracting PE structure information of all samples in a sample library U;
step three, calculating a fuzzy interval and a membership degree of each dimensionality by using a fuzzy set theory, and finally obtaining a fuzzy feature vector of each sample in a sample library U;
step four, inputting a sample to be detected, extracting PE structure information of the sample to be detected, and calculating a fuzzy feature vector of the sample to be detected according to the method in the step three;
step five, finding L samples which are most matched with the fuzzy interval of the sample to be detected from a sample library U through maximum fuzzy interval matching;
sixthly, calculating Euclidean distances between the L samples found in the step five and the sample to be detected, and finding K samples closest to the samples, wherein L is larger than or equal to K;
step seven, sorting the Euclidean distances between the K nearest neighbor samples and the samples to be detected from small to large, and using the reciprocal of the subscript as the voting weight;
and step eight, counting the sum of the voting weights of each category, and using the category with the largest sum of the voting weights as a prediction label.
2. The method for identifying Windows malware based on fuzzy K-neighbors as claimed in claim 1, wherein the operation steps of said step three are as follows:
step A, counting the value range of each characteristic dimension in a sample library U to obtain the minimum value and the maximum value of each characteristic dimension;
b, dividing the value range of each characteristic dimension into N fuzzy intervals by using a fuzzy set theory;
and C, calculating the fuzzy interval and the membership degree of each dimension of each sample one by one to obtain the fuzzy eigenvector of each sample.
3. The Windows malware identification method based on fuzzy K neighbors of claim 1, wherein the operation steps of the fifth step are as follows:
step A, initializing a parameter N = a characteristic dimension number, initializing a parameter K, wherein K is more than or equal to 1 and is less than or equal to the size of a sample library U, and initializing a fuzzy interval matching set S;
b, initializing a parameter i =1;
step C, calculating the fuzzy interval matching degree of the sample to be detected and the ith known sample in the sample library U, and if the matching degree is equal to N, adding the ith known sample to a fuzzy interval matching set S;
step D, i is increased progressively, if i is larger than the size of the sample library U, step E is executed, otherwise step C is executed;
step E, if the size of the fuzzy interval matching set S is larger than or equal to K, returning to the set S; otherwise, N is decreased and the step B is returned.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910260519.6A CN110135155B (en) | 2019-04-02 | 2019-04-02 | Fuzzy K neighbor-based Windows malicious software identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910260519.6A CN110135155B (en) | 2019-04-02 | 2019-04-02 | Fuzzy K neighbor-based Windows malicious software identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110135155A CN110135155A (en) | 2019-08-16 |
CN110135155B true CN110135155B (en) | 2023-02-10 |
Family
ID=67568996
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910260519.6A Active CN110135155B (en) | 2019-04-02 | 2019-04-02 | Fuzzy K neighbor-based Windows malicious software identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110135155B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111476297A (en) * | 2020-04-07 | 2020-07-31 | 中国民航信息网络股份有限公司 | Category determination method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106101252A (en) * | 2016-07-01 | 2016-11-09 | 何钟柱 | Information Security Risk guard system based on big data and trust computing |
CN107273746A (en) * | 2017-05-18 | 2017-10-20 | 广东工业大学 | A kind of mutation malware detection method based on APK character string features |
CN109255363A (en) * | 2018-07-11 | 2019-01-22 | 齐鲁工业大学 | A kind of fuzzy k nearest neighbor classification method and system based on weighted chi-square distance metric |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9652688B2 (en) * | 2014-11-26 | 2017-05-16 | Captricity, Inc. | Analyzing content of digital images |
US11481492B2 (en) * | 2017-07-25 | 2022-10-25 | Trend Micro Incorporated | Method and system for static behavior-predictive malware detection |
-
2019
- 2019-04-02 CN CN201910260519.6A patent/CN110135155B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106101252A (en) * | 2016-07-01 | 2016-11-09 | 何钟柱 | Information Security Risk guard system based on big data and trust computing |
CN107273746A (en) * | 2017-05-18 | 2017-10-20 | 广东工业大学 | A kind of mutation malware detection method based on APK character string features |
CN109255363A (en) * | 2018-07-11 | 2019-01-22 | 齐鲁工业大学 | A kind of fuzzy k nearest neighbor classification method and system based on weighted chi-square distance metric |
Non-Patent Citations (2)
Title |
---|
Dynamic API call sequence visualisation for malware classification;Mingdong Tang;《IET Information Security》;20190306;第368-377页 * |
基于FKNN算法的GIS运行状态评估研究;方钦等;《湖北工业大学学报》;20180430;第62-66页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110135155A (en) | 2019-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fan et al. | Malicious sequential pattern mining for automatic malware detection | |
Alasmary et al. | Analyzing and detecting emerging Internet of Things malware: A graph-based approach | |
US10200391B2 (en) | Detection of malware in derived pattern space | |
Tian et al. | An automated classification system based on the strings of trojan and virus families | |
CN107315956B (en) | It is a kind of for quick and precisely detecting the Graph-theoretical Approach of Malware on the zero | |
Parsazad et al. | Fast feature reduction in intrusion detection datasets | |
Morales-Molina et al. | Methodology for malware classification using a random forest classifier | |
Imran et al. | Using hidden markov model for dynamic malware analysis: First impressions | |
Elkhawas et al. | Malware detection using opcode trigram sequence with SVM | |
Jang et al. | Mal-netminer: malware classification based on social network analysis of call graph | |
CN111259397A (en) | Malware classification method based on Markov graph and deep learning | |
Wan et al. | IoT-malware detection based on byte sequences of executable files | |
San et al. | Malicious software family classification using machine learning multi-class classifiers | |
Park et al. | Birds of a feature: Intrafamily clustering for version identification of packed malware | |
Khan et al. | Determining malicious executable distinguishing attributes and low-complexity detection | |
Sivakumar et al. | Malware Detection Using The Machine Learning Based Modified Partial Swarm Optimization Approach | |
Li et al. | MDBA: Detecting malware based on bytes n-gram with association mining | |
CN110135155B (en) | Fuzzy K neighbor-based Windows malicious software identification method | |
CN112257076B (en) | Vulnerability detection method based on random detection algorithm and information aggregation | |
Kim et al. | Malicious behavior detection method using api sequence in binary execution path | |
Pranav et al. | Detection of botnets in IoT networks using graph theory and machine learning | |
Muhaya et al. | Polymorphic malware detection using hierarchical hidden markov model | |
CN115118482B (en) | Industrial control system intrusion detection clue analysis and tracing method, system and terminal | |
CN115545091A (en) | Integrated learner-based malicious program API (application program interface) calling sequence detection method | |
US11977633B2 (en) | Augmented machine learning malware detection based on static and dynamic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |