CN113468532B

CN113468532B - Malicious software family inference method and system

Info

Publication number: CN113468532B
Application number: CN202110820216.2A
Authority: CN
Inventors: 朱宏宇; 田建伟; 田峥; 蒋永康; 李生红; 杨志邦; 黎曦; 李琪瑶; 张宇翔
Original assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Hunan Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Hunan Electric Power Co Ltd
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2022-09-23
Anticipated expiration: 2041-07-20
Also published as: CN113468532A

Abstract

The invention discloses a malicious software family inference method, which comprises the steps of obtaining an executable program and constructing a sample set; inquiring the sample set and recording the inquiry result; scanning a sample set by adopting a plurality of antivirus engines and preprocessing a scanning result; utilizing heuristic rules to infer the maliciousness of the sample and output an inference result; and carrying out maximum likelihood estimation modeling and solving to output a final family inference result. The invention also discloses a system for realizing the malicious software family inference method. The method solves the problems of label fluctuation of the antivirus engine and high false alarm rate when facing the shell-added sample, reduces the influence of family popularity on the inference result, can effectively realize the automation of a large-scale malicious software family data set, and has high accuracy and strong robustness.

Description

Malicious software family inference method and system

Technical Field

The invention belongs to the technical field of computer security, and particularly relates to a method and a system for inferring a malicious software family.

Background

Malware is a long-standing and increasingly serious problem. In recent years, the abuse of botnet mirai and Lexu software wanncry, etc., caused great loss to public infrastructure and users' personal property. The increase in complexity and size of malicious programs has driven the secure community to explore better analytical tools and methods.

Classification of malware is a problem that needs to be improved. Malware classification mainly involves two tasks: 1) malicious software detection, which distinguishes benign software from malicious software; 2) and identifying and distinguishing different types of malware by the malware family. The method can accurately detect the malice of the program, and can effectively block the spread of malicious software; accurate identification of different malicious software families can effectively reduce the manual analysis cost and trace the source of malicious behaviors. Currently, antivirus engines mainly use a feature signature-based approach to accomplish the above-mentioned task of malware classification. The feature signatures are essentially pattern matches of strings, specific to detected malware, require manual creation by security analysts, and are labor intensive; therefore, when polymorphic and qualitative techniques are widely applied by malware, the expansion of feature signatures becomes a difficult problem.

Machine learning is viewed by the security community as the most likely approach to solving the malware family classification problem. Researchers are actively exploring machine learning based malware family classification models. As in the field of computer vision and natural language processing, the primary task of building machine learning models is to obtain more and better labeled data. It is well known that good data sets are one of the most effective methods to improve the accuracy of machine learning systems.

The manual marking of the malicious software needs a solid professional background; in the face of a huge amount of program samples, the crowdsourcing labeling method of the ImageNet (Jia Deng, 2010) formula is no longer applicable. At present, a security community lacks a recognized malware data set, and an academic community does not have too much deep research on automatic labeling of malware; the ground truth of most models at the time of verification is obtained from the scanning result of the antivirus engine according to experience. For the malicious labeling of software, a commonly adopted strategy is to upload suspicious samples to VirusTotal, scan the samples by tens of antivirus engines integrated by the VirusTotal, and then determine the malicious nature of the samples through an N/K type threshold strategy (N represents the number of antivirus engines identifying the samples as malicious, and K represents the total number of antivirus engines). For example, Saxe and Berlin work in 2015, using a threshold value of N/K ≧ 0.3; incer et al worked in 2018 using a threshold value of N.gtoreq.4. For family tagging of malware, the Avclass tool proposed by Marcos et al at 2016 is widely used; avclass extracts family tags from many inconsistent antivirus engine results and then identifies families using a Plurality of votes (Plurality Vote). Unfortunately, although Avclass can extract family tags, the accuracy of its tags is significantly affected by the popularity of the family.

Zhu et al in 2020 on the antivirus engine showed that: and the reasonable k value is selected, so that the maliciousness of the sample can be effectively marked. The research also shows that: 1) the maliciousness detection result of the antivirus engine on the sample fluctuates along with time, and generally can be stable after several months; 2) the malicious false alarm rate of the antivirus engine on the new sample is high; 3) the malicious detection false alarm rate of the antivirus engine on the shelled samples is very high;

in conclusion, malware family inference based on antivirus engines remains an open problem.

Disclosure of Invention

The invention aims to provide a malicious software family inference method with high accuracy and strong robustness.

The invention also aims to provide a system for realizing the malware family inference method.

The invention provides a malware family inference method, which comprises the following steps:

s1, obtaining an executable program and constructing a sample set;

s2, inquiring and recording the executable programs in the sample set;

s3, scanning executable programs in the sample set by adopting a plurality of antivirus engines, and preprocessing scanning results;

s4, based on the preprocessing result obtained in the step S3, utilizing a heuristic rule to deduce the maliciousness of the sample, and outputting a deduction result;

and S5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving, and outputting and storing a family inference result.

In step S2, the querying and recording of the executable programs in the sample set specifically queries whether the inference result of the current program already exists: if yes, directly returning an inference result; if not, recording the appearance time and scanning time of the program.

In step S3, the multiple antivirus engines are used to scan the executable programs in the sample set and to preprocess the scanning results, specifically, the existing multiple antivirus engines are used to scan the executable programs in the sample set and to process the scanning results as follows: the method comprises the steps of same result filtering, result label word segmentation, common character string and random character string filtering, malicious family alias replacement, malicious degree calculation and family maximum engine consent degree calculation.

The filtering of the same result specifically refers to that different antivirus engines scan and obtain different label modes for the same executable program; if the labels of a plurality of antivirus engines for the same executable program are the same, one label is randomly selected and stored.

The malicious family alias replacement is to replace a family alias by using a malicious software family alias database, so that the family names are unified; the malicious software family alias database is obtained by learning the following strategies: let A and B be the names of two families, A → B is the rule of family names A to B; defining a support degree support (A → B) ═ P (AB) as the probability of the simultaneous occurrence of A and B, and a confidence degree confidence (A → B) ═ P (B/A) ═ P (AB)/P (A) as the probability of the occurrence of B when A occurs; if the support (A → B) is higher than the set threshold value, the confidence (A → B) is higher than the set threshold value and the A → B is judged to have the alias meaning manually, the B is determined to be the alias of the A; otherwise, B is assumed to be a different name than A.

The malice degree calculation specifically adopts the following formula to calculate the malice degree E ₁ ：

E ₁ ＝n/K

Wherein n is the number of antivirus engines which identify the sample as malicious after filtering the result; k is the total number of antivirus engines after result filtering.

The calculation of the maximum engine consent degree of the family specifically adopts the following formula to calculate the maximum engine consent degree E of the family ₂ ：

E ₂ ＝max(∑ _i∈k f _ij )

In the formula f _ij Whether the antivirus engine i gives a family tag j, and if the antivirus engine i gives a family tag j, f _ij 1, otherwise f _ij ＝0。

Step S4, where the method performs maliciousness inference on the sample by using heuristic rules based on the preprocessing result obtained in step S3, and outputs an inference result, the method specifically includes the following steps, and each step outputs a Boolean result (True, False):

A. judging whether the difference between the scanning time of the sample and the appearance time of the sample is larger than a set threshold value or not, and outputting a judgment result r _a ；

B. Judging whether the sample is shelled or not, and outputting a judgment result r _b ；

C. Judging whether the maliciousness of the sample is larger than a set threshold value or not, and outputting a judgment result r _c ；

D. Judging whether the maximum engine maliciousness of the family of the sample is larger than a set threshold value or not, and outputting a judgment result r _d ；

E. And according to the judgment results of the steps A to D, calculating the maliciousness result r of the sample by adopting the following formula:

r＝r _a &r _b &r _c &r _d

wherein, & is AND operation.

The step S5 of performing maximum likelihood estimation modeling and solving based on the preprocessing result obtained in the step S3 to output a final family inference result specifically includes the following steps:

k antivirus engines are set, I samples and J families are set, and the real family of the sample I is Y _i And obey probability is θ ═ θ ₁ ,θ ₂ ,...,θ _J ]Is randomly distributed, the corresponding likelihood function is defined as

Wherein T is _ij Is an indicator variable and if q is the true family of sample i, then T _iq 1 or else T _ij 0, and j ≠ q; theta _j Probability of sample i from family j;

probability of identifying j family samples as l family for antivirus engine k; and theta _j And

unknown; finding theta when the function h gets the maximum value by adopting an expectation maximization algorithm _j And

thereby estimating the true family tag of the sample

The expectation maximization algorithm is an iterative optimization process, wherein theta _j And

are all estimated as

Calculating h, theta by circulation _j And

until the function h converges, the algorithm is ended, and a real family tag is output.

The invention also provides a system for realizing the malicious software family inference method, which specifically comprises a data acquisition module, a database module, a scanning module, a malice inference module and a family inference module; the data acquisition module, the database module, the scanning module, the malice inference module and the family inference module are sequentially connected in series; the data acquisition module is used for acquiring an executable program and constructing a sample set; the database module is used for inquiring each executable program in the sample set and recording the result; the scanning module is used for scanning the executable programs in the sample set by adopting a plurality of antivirus engines and preprocessing the scanning result; the malice inference module is used for inferring the malice of the sample by utilizing a heuristic rule based on the obtained preprocessing result and outputting an inference result; and the family inference module is used for carrying out maximum likelihood estimation modeling and solving based on the obtained preprocessing result so as to output a final family inference result.

The method and the system for inferring the malicious software family creatively model the family inference of the malicious software into a two-stage task, and comprise the following steps of inferring the malicious software based on heuristic rules and inferring the family based on expectation maximization: the former effectively solves the problem of label fluctuation of the antivirus engine and the problem of high false alarm rate when facing a crust sample; the influence of family popularity on the inference result is reduced as much as possible by the maximum likelihood estimation modeling and the expectation maximization method, and the accuracy of the inference result is improved; therefore, the method can effectively realize the automatic labeling of the large-scale malicious software family data set, and has high precision and strong robustness.

Drawings

FIG. 1 is a schematic process flow diagram of the process of the present invention.

FIG. 2 is a flow chart illustrating the process of querying the database by the executable software according to the embodiment of the method of the present invention.

FIG. 3 is a flowchart illustrating scan result preprocessing according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating malicious intent inference according to an embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating probability transition of malicious family inference according to an embodiment of the present invention.

Fig. 6 is a functional block diagram of the system of the present invention.

Detailed Description

FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a malware family inference method, which comprises the following steps:

s1, obtaining an executable program and constructing a sample set;

s2, inquiring and recording each executable program in the sample set; specifically, whether an inference result of the current program exists or not is inquired, and if yes, the inference result is directly returned; if not, recording the appearance time and the scanning time of the program;

s3, scanning executable programs in the sample set by adopting a plurality of antivirus engines, and preprocessing scanning results; specifically, the method adopts the existing multi-antivirus engine to scan the executable program in the sample set, and processes the scanning result as follows: the method comprises the following steps of filtering the same result, segmenting words by result labels, filtering common character strings and random character strings, replacing alias names of malicious families, calculating the malicious degree and calculating the maximum engine consent degree of the families;

in specific implementation, the same result is filtered, specifically, different antivirus engines scan and obtain different label modes for the same executable program; if the labels of the antivirus engines for the same executable program are the same, one label is randomly selected and stored;

replacing the malicious family alias, specifically, replacing the family alias by using a malicious software family alias database, so as to realize the unification of the family names; the malicious software family alias database is obtained by learning the following strategies: let A and B be the names of two families, A → B is the rule for family names A to B; defining a support degree support (A → B) ═ P (AB) as the probability of the simultaneous occurrence of A and B, and a confidence degree confidence (A → B) ═ P (B/A) ═ P (AB)/P (A) as the probability of the occurrence of B when A occurs; if the support (A → B) is higher than the set threshold value, the confidence (A → B) is higher than the set threshold value and the A → B is judged to have the alias meaning manually, the B is determined to be the alias of the A; otherwise, identifying B and A as different names;

calculating the malice degree, specifically calculating the malice degree E by adopting the following formula ₁ ：

E ₁ ＝n/K

Wherein n is the number of antivirus engines which identify the sample as malicious after filtering the result; k is the total number of the antivirus engines after result filtering;

calculating the maximum engine consent of the family, specifically, calculating the maximum engine consent of the family E by adopting the following formula ₂ ：

E ₂ ＝max(∑ _i∈k f _ij )

In the formula f _ij Whether the antivirus engine i gives a family tag j, and if the antivirus engine i gives a family tag j, f _ij 1, otherwise f _ij ＝0；

S4, based on the preprocessing result obtained in the step S3, utilizing a heuristic rule to deduce the maliciousness of the sample, and outputting a deduction result; the method specifically comprises the following steps, and Boolean results (True, False) are input in each step:

A. judging whether the difference between the scanning time of the sample and the appearance time of the sample is larger than a set threshold value or not, and outputting a judgment result r _a (ii) a For example, if it is determined that the difference between the sample scanning time and the sample appearance time is greater than the set threshold value, r is _a 0, otherwise r _a ＝1；

B. Judging whether the sample is shelled or not, and outputting a judgment result r _b (ii) a For example, if it is determined that the sample is shelled, r _b 0, otherwise r _b ＝1；

C. Judging whether the maliciousness of the sample is larger than a set threshold value or not, and outputting a judgment result r _c (ii) a For example, if the maliciousness of the sample is determined to be greater than the set threshold value, r _c 0, otherwise r _c ＝1；

D. Judging whether the maximum engine malice of the family of the sample is larger than a set threshold value or not, and outputting a judgment result r _d (ii) a For example, if the maximum engine maliciousness of the family of the sample is determined to be greater than the set threshold value, r _d 0, otherwise r _d ＝1；

E. And according to the judgment results of the steps A to D, calculating the malice result r of the sample by adopting the following formula:

r＝r _a &r _b &r _c &r _d

wherein, & is AND operation; for example, if any one of the determination results in steps a to D is 0, the final maliciousness result is determined to be 0;

s5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving to output a final family inference result; the method specifically comprises the following steps:

k antivirus engines are set, I samples and J families are set, and the real family of the sample I is Y _i And obey probability is θ ═ θ ₁ ,θ ₂ ,...,θ _J ]Is then the corresponding likelihood function is defined as

probability of identifying a j family sample as a l family for the antivirus engine k; and theta _j And

thereby estimating the true family tag of the sample

In particular implementation, the expectation maximization algorithm is embodied as an iterative optimization process, where θ _j And

are all estimated as

Calculating h, theta by circulation _j And

until the function h converges, the algorithm is ended at this moment, and a real family tag is output;

when the method is implemented concretely, a browser/server architecture is adopted, and each module is configured at a server end to complete core functions such as scanning, inference and the like; the browser-side is configured to bulk upload malware and present malicious family inferences.

The process of the invention is further illustrated below with reference to one example:

the method comprises the following steps:

s1, obtaining an executable program and constructing a sample set;

s2, inquiring and recording each executable program in the sample set; as shown in fig. 3, the method specifically includes:

sample computation hash value (hash) can be performed, queries in a database;

if the data does not exist, scanning, carrying out malicious inference and family inference, and updating the database;

s3, scanning executable programs in the sample set by adopting a plurality of antivirus engines, and preprocessing scanning results; as shown in fig. 4, the method specifically includes:

calling an antivirus engine to obtain a scanning result;

filtering the same result;

segmenting words by result labels;

filtering the universal character string and the random character string;

malicious family alias replacement;

calculating the degree of malice and the maximum engine consent degree of the family;

s4, based on the preprocessing result obtained in the step S3, utilizing a heuristic rule to deduce the maliciousness of the sample, and outputting a deduction result; as shown in fig. 5, the method specifically includes:

judging the difference between the sample scanning time and the sample appearance time;

judging whether the sample is shelled;

judging the sample maliciousness;

judging the maximum engine consent of the family;

and S5, based on the preprocessing result obtained in the step S3, performing maximum likelihood estimation modeling and solving, and outputting a final family inference result.

FIG. 6 is a schematic diagram of functional modules of the system of the present invention: the invention also provides a system for realizing the malicious software family inference method, which specifically comprises a data acquisition module, a database module, a scanning module, a malice inference module and a family inference module; the data acquisition module, the database module, the scanning module, the malice inference module and the family inference module are sequentially connected in series; the data acquisition module is used for acquiring an executable program and constructing a sample set; the database module is used for inquiring each executable program in the sample set and recording inquiry results; the scanning module is used for scanning the executable programs in the sample set by adopting a plurality of antivirus engines and preprocessing the scanning result; the malice inference module is used for inferring the malice of the sample by utilizing a heuristic rule based on the obtained preprocessing result and outputting an inference result; and the family inference module is used for carrying out maximum likelihood estimation modeling and solving based on the obtained preprocessing result so as to output a final family inference result.

Claims

1. A malware family inference method comprising the steps of:

s1, obtaining an executable program and constructing a sample set;

s2, inquiring and recording the executable programs in the sample set;

s3, scanning executable programs in the sample set by adopting multiple antivirus engines, and preprocessing a scanning result; specifically, the method adopts the existing multi-antivirus engine to scan the executable program in the sample set, and processes the scanning result as follows: the method comprises the following steps of filtering the same result, segmenting words by result labels, filtering general character strings and random character strings, replacing malicious family aliases, calculating the malicious degree and calculating the maximum engine malicious degree of a family;

the malicious family alias replacement is to replace a family alias by using a malicious software family alias database, so that the family names are unified; the alias database of the malware family is obtained by learning the following strategies: let A and B be the names of two families, A → B is the rule for family names A to B; defining a support degree support (A → B) ═ P (AB) as the probability of the simultaneous occurrence of A and B, and a confidence degree confidence (A → B) ═ P (B/A) ═ P (AB)/P (A) as the probability of the occurrence of B when A occurs; if the support (A → B) is higher than the set threshold value, the confidence (A → B) is higher than the set threshold value and the A → B is judged to have the alias meaning manually, the B is considered as the alias of the A; otherwise, identifying B and A as different names;

E ₁ ＝n/K

the calculation of the maximum engine malice degree of the family specifically adopts the following formula to calculate the maximum engine malice degree E of the family ₂ ：

E ₂ ＝max(∑ _i∈k f _ij )

S4, deducing the maliciousness of the sample by using a heuristic rule based on the preprocessing result obtained in the step S3, and outputting a deduction result; the method specifically comprises the following steps, wherein each step outputs a Boolean result:

r＝r _a &r _b &r _c &r _d

wherein, & is AND operation;

2. The malware family inference method of claim 1, wherein step S2 queries and records executable programs in the sample set, specifically to query whether the inference result of the current program already exists: if yes, directly returning an inference result; if not, recording the appearance time and scanning time of the program.

3. The malware family inference method of claim 2, wherein the step S5 is performed by performing maximum likelihood estimation modeling and solving based on the preprocessing result obtained in the step S3, so as to output a final family inference result, and specifically comprises the following steps:

thereby estimating the true family tag of the sample

4. Malware family inference method as claimed in claim 3, characterized in that said expectation maximization algorithm is an iterative optimization process, where θ _j And

are all estimated as

Calculating h, theta by circulation _j And

5. A system for implementing the malware family inference method of any one of claims 1-4, specifically comprising a data acquisition module, a database module, a scanning module, a maliciousness inference module, and a family inference module; the data acquisition module, the database module, the scanning module, the malice inference module and the family inference module are sequentially connected in series; the data acquisition module is used for acquiring an executable program and constructing a sample set; the database module is used for inquiring each executable program in the sample set and recording results; the scanning module is used for scanning the executable programs in the sample set by adopting a plurality of antivirus engines and preprocessing the scanning result; the maliciousness inference module is used for inferring the maliciousness of the sample by utilizing heuristic rules based on the obtained preprocessing result and outputting an inference result; and the family inference module is used for carrying out maximum likelihood estimation modeling and solving based on the obtained preprocessing result so as to output a final family inference result.