CN113761912A

CN113761912A - Interpretable judging method and device for malicious software attribution attack organization

Info

Publication number: CN113761912A
Application number: CN202110909793.9A
Authority: CN
Inventors: 严寒冰; 王琴琴; 周彧; 梅瑞; 张永铮
Original assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Information Engineering of CAS; National Computer Network and Information Security Management Center
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-12-07
Anticipated expiration: 2041-08-09
Also published as: CN113761912B

Abstract

The invention discloses an interpretable judging method and device for malicious software attribution attack organization, which analyzes the attack organization attribution of malicious software by extracting code characteristics and character string characteristics of the malicious software, and integrates static characteristics and dynamic characteristics of the malicious software, so that the characteristics of the invention are more comprehensive, the characteristics are vectorized by using a natural language processing technology, and meanwhile, the invention uses a model interpretation technology to interpret the result of a classifier, so that the classification result is more convincing, thereby effectively solving the problem that the attack organization attribution of the malicious software cannot be comprehensively analyzed in the prior art.

Description

Interpretable judging method and device for malicious software attribution attack organization

Technical Field

The invention relates to the technical field of computers, in particular to an interpretable judgment method and an interpretable judgment device for malicious software attribution attack organization.

Background

Attack organizations often structure malware to implement cyber attacks. Advanced persistent threat attacks, also known as targeted threat attack, apt (advanced persistent attack) attacks, are one type of cyber attack. APT attacks refer to a process of computer intrusion that is secure and persistent, and are often carefully planned by someone to target a particular target. It is usually for commercial or political reasons, specific to a particular organization or country, and requires high concealment to be maintained over a long period of time. Cyber attacks expose cyber-space security to serious threats. Therefore, it is very necessary to perform network attack analysis and attack organization research. These efforts rely on analysis of malware. However, the features of attack organization attribution of the malicious software are selected only singly, so that the attack organization attribution features are not comprehensive enough.

Disclosure of Invention

The invention provides an interpretable judgment method and an interpretable judgment device for attack organizations to which malicious software belongs, and aims to solve the problem that the attack organizations to which the malicious software belongs cannot be comprehensively analyzed in the prior art.

In a first aspect, the present invention provides an interpretable decision method for a malware attribution attack organization, the method comprising: extracting code features of the malicious software, preprocessing the code features, and vectorizing the preprocessed code features, wherein the code features are the malicious software features taking a function as a unit; extracting character string features of malicious software, preprocessing the character string features, and vectorizing the preprocessed character string features; and performing attack organization attribution of the malicious software based on the vectorized code features and the character string features, and respectively interpreting the classification results of the code features and the character string features.

Optionally, the extracting code features of the malware includes: extracting metadata of the malicious software, and converting the metadata into IR intermediate representation; the metadata comprises a hash of the malware, a compiler, a function name of each function, a Control Flow Graph (CFG), basic blocks and byte codes.

Optionally, the preprocessing the code feature includes: and generalizing the low-frequency words after the IR intermediate representation conversion, and converting all functions of the malicious software into sequential texts according to the program call graph.

Optionally, the vectorizing the pre-processed code features includes: a function vector is generated for the text of each function using the PV-DM algorithm.

Optionally, the extracting the character string features of the malware includes: and extracting a behavior report of the malicious software through a hash value of the malicious software.

Optionally, preprocessing the character string features includes: and segmenting the text in the behavior report.

Optionally, the attributing of attack organization of malware based on vectorized code features and character string features includes: obtaining the classification probability of each vectorized code feature through a random forest classifier; obtaining the classification probability of the character string characteristics after vectorization through a DNN classifier; and integrating the classification probabilities of the multiple code characteristics and the classification probabilities of the character string characteristics to obtain a final classification result of the malicious software.

Optionally, the separately interpreting the code feature and the character string feature classification result includes: and interpreting the code feature classification result through a random forest, and interpreting the character string feature classification result through LIME.

In a second aspect, the present invention provides an interpretable decision-making apparatus for a malware attribution attack organization, the apparatus comprising: the first processing unit is used for extracting code features of malicious software, preprocessing the code features and vectorizing the preprocessed code features, wherein the code features are the malicious software features taking a function as a unit; the second processing unit is used for extracting character string features of the malicious software, preprocessing the character string features and vectorizing the preprocessed character string features; and the third processing unit is used for performing attack organization attribution of the malicious software based on the vectorized code features and the character string features and respectively interpreting the code features and the character string feature classification results.

In a third aspect, the present invention provides a computer-readable storage medium, in which a signal-mapped computer program is stored, and the computer program, when executed by at least one processor, implements any one of the above-mentioned interpretable determination methods for a malware attribution attack organization.

The invention has the following beneficial effects:

the attack organization attribution of the malicious software is analyzed by extracting the code characteristics and the character string characteristics of the malicious software, and the two characteristics synthesize the static characteristics and the dynamic characteristics of the malicious software, so that the characteristics of the attack organization attribution analysis method are more comprehensive, the characteristics are vectorized by using a natural language processing technology, and meanwhile, the results of a classifier are explained by using a model interpretation technology, so that the classification results are more convincing, and the problem that the attack organization attribution of the malicious software cannot be comprehensively analyzed in the prior art is effectively solved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating an interpretable determination method for a malware attribution attack organization according to a first embodiment of the present invention;

fig. 2 is a schematic structural diagram of an interpretable determination apparatus for a malware-attributive attack organization according to a first embodiment of the present invention.

Detailed Description

Aiming at the problem that the attack organization affiliation of the malicious software cannot be comprehensively analyzed in the prior art, the attack organization affiliation of the malicious software is analyzed by extracting the code characteristics and the character string characteristics of the malicious software, and particularly, because the two characteristics integrate the static characteristics and the dynamic characteristics of the malicious software, the characteristics are more comprehensive, and the characteristics are vectorized by using a natural language processing technology. The present invention will be described in further detail below with reference to the drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

A first embodiment of the present invention provides an interpretable determination method for malicious software attack organization, and referring to fig. 1, the method includes:

s101, extracting code features of malicious software, preprocessing the code features, and vectorizing the preprocessed code features, wherein the code features are the malicious software features taking a function as a unit;

in specific implementation, the metadata extraction is performed on the malware, the metadata is converted into IR intermediate representation, then the low-frequency words after the IR intermediate representation conversion are generalized, all functions of the malware are converted into sequential texts according to a program call graph, and finally a function vector is generated for the text of each function by using a PV-DM algorithm.

The metadata in the embodiment of the present invention includes a hash of malware, a compiler, a function name of each function, a Control Flow Graph (CFG), basic blocks, and byte codes.

Specifically, in the embodiment of the present invention, the code feature extraction and preprocessing includes: code features refer to malware features in units of functions. In order to obtain function information of the malware, the IDAPython and IDA Pro or other code analysis tools are used for extracting metadata, wherein the metadata comprises hash and builder of the malware, and a function name, a Control Flow Graph (CFG), basic blocks and byte codes of each function. Wherein the function does not include a library function. To reduce the differences with different platforms and multiple compiler options, the present invention uses VEX (or other intermediate representation conversion tools) for Intermediate Representation (IR) conversion. The bytecodes are thus converted into VEX IR, with the compiler information being used for the conversion parameters of the IR.

The vectorization of the code features in the embodiment of the invention specifically comprises the following steps: the present invention uses the PV-DM paragraph vector algorithm to vectorize the functions, generating a vector for each function. The input of the paragraph vector algorithm is a document and the structure of the function is a CFG graph structure, so the invention generates a document for each function. Specific algorithm the following algorithm takes a document as input when generating a function vector, where one VEX IR sentence is treated as one word. To reduce the impact of low frequency vocabulary in documents on the results, it is generalized. Specifically, the constant is replaced by < num >, the temporary variable is replaced by < tmp >, the character string is replaced by < str >, the function name is replaced by < func >, the register is replaced by < reg >, and other low-frequency words are replaced by < other >.

S102, extracting character string features of malicious software, preprocessing the character string features, and vectorizing the preprocessed character string features;

specifically, the embodiment of the invention uses a PV-DM algorithm to generate a function vector for the text of each function, extracts the behavior report of the malicious software through the hash value of the malicious software, and finally performs word segmentation on the text in the behavior report.

S103, attack organization attribution of malicious software is carried out on the basis of the vectorized code features and the character string features, and classification results of the code features and the character string features are respectively explained.

That is, in the embodiment of the present invention, the random forest classifier is used to obtain the classification probability of each vectorized code feature, the DNN classifier is used to obtain the classification probability of each vectorized character string feature, and finally the classification probabilities of a plurality of code features and the classification probabilities of the character string features are integrated to obtain the final classification result of the malware.

The respectively explaining the code characteristics and the character string characteristic classification results in the embodiment of the invention is to explain the code characteristic classification results through a random forest and explain the character string characteristic classification results through LIME.

In specific implementation, the character string feature extraction and preprocessing according to the embodiment of the present invention includes: string features refer to dynamic behavior reports of malware. The dynamic behavior report for malware may be downloaded from VirusTotal, or obtained from a Cuckoo sandbox, based on the hash value. The dynamic behavior report is a JSON file.

The embodiment of the invention vectorizes character string characteristics, which comprises the following steps: character string feature vectorization uses a common method in natural language processing, namely a one hot vector. For character string feature vectorization, the file content is subjected to word segmentation, an NLTK method is used for word segmentation, and in consideration of the fact that a large number of special characters exist in the content, the special characters are used for word segmentation. This method is then used to generate vectors for the reported features.

In the embodiment of the present invention, the attack organization attribution specifically includes: attack organization attribution is implemented using a classifier. The code features use a random forest classifier, and function vectors are used as input, because the malicious software has a plurality of functions, the classification probability of each function is obtained through the classifier. The character string features use a DNN classifier and report vectors as input to obtain report classification probabilities. And integrating the classification probabilities and the report classification probabilities of the plurality of functions to obtain a final classification result of the malicious software. The method comprises the following specific steps:

assume that a binary file has n functions, each f₁，f₂，...，f_n. The function vectors are respectively

The reporting vector is v_r. The output of the classifier is P ═<p₁，p₂，...，p_m>Where m is the number of attacking tissue, p₁Representing the probability that the classifier predicted the input as the first attacking tissue. According to the method of the invention, a plurality of functions and report vectors are used as input, and the prediction result is obtained by using corresponding classifiers

Combining the prediction results, the prediction probability of the malicious software is P ═<P₁，P₂，...，P_m>Wherein

Means that

P1 in (1). And the attack organization corresponding to the maximum probability value in the P is the final attack organization attribution result of the malicious software.

It should be noted that the model interpretation in the embodiment of the present invention is to interpret the classification result, that is, what features enable the classification model to make such classification decision. For random forest classifiers, the method of the invention is naturally interpretable. For the DNN classifier, LIME was used for model interpretation.

According to the attack organization attribution result of the malicious software, corresponding attack organization prediction probabilities in the function prediction result are sorted from large to small. The first few functions are the interpretation results of the code feature attribution model, and the functions are key functions with important attention. Meanwhile, in order to find out which functions in the attack organization are similar to the key functions, the similarity between the functions is calculated by using cosine distances of function vectors, and the smaller the distance is, the more similar the functions are.

The character string feature attribution model is interpreted using the LIME model. The result of LIME is a feature rank and corresponding contribution value. The contribution value represents the contribution of the feature to the classification.

Generally speaking, the interpretable attack organization attribution method for the malicious software, which is provided by the invention, can be used for performing attack organization attribution on suspicious malicious software and obtaining important characteristics of the attack organization attribution for network security technicians. This provides an important basis for the analysis of attack organization attacks and threat intelligence.

The method according to an embodiment of the invention will be explained and illustrated in detail below by means of a specific example:

the embodiment of the invention provides a method for explaining attack organization affiliation of malicious software, which comprises the following steps:

extracting code characteristics of the malicious software, namely extracting metadata of the malicious software, and then converting the metadata into Intermediate Representation (IR);

preprocessing the code characteristics, namely generalizing low-frequency words in the IR, and converting all functions of malicious software into sequential texts according to a program call graph;

vectorizing code characteristics, namely generating a function vector for the text of each function by using a PV-DM (para-DM) algorithm;

extracting character string characteristics of the malicious software, namely acquiring a behavior report from VirusTotal by using a hash value of the malicious software;

the character string features are preprocessed by segmenting the text in the behavior report. The word segmentation method uses an NLTK method and special character word segmentation;

vectorization of character string features, namely generating report vectors for behavior reports by using a one hot encoding method;

attack organization attribution of malicious software is that attack organization attribution of code features uses a function vector as input, a random forest classifier performs attack organization classification, attack organization attribution of character string features uses a report vector as input, a DNN (deep neural network) classifier performs attack organization classification, classification results of the code features and the character string features are integrated, and final attack organization attribution results of the malicious software are performed;

and model interpretation, namely respectively interpreting code features and character string feature classification results by using random forests and LIME (local interpretation model-explicit).

Generally speaking, the embodiment of the invention extracts the code features and character string features of the malicious software to synthesize the static features and dynamic features of the malicious software, the features are more comprehensive, the features are vectorized by using a natural language processing technology, wherein PV-DM vectorizes functions, one hot encoding vectorizes behavior reports, the technology can fully express the semantics of the malicious software, and the results of a classifier are interpreted by using a model interpretation technology, so that the classification results are more convincing.

A second embodiment of the present invention provides an interpretable determination apparatus for malicious software belonging attack organization, and referring to fig. 2, the apparatus includes: the first processing unit is used for extracting code features of malicious software, preprocessing the code features and vectorizing the preprocessed code features, wherein the code features are the malicious software features taking a function as a unit; the second processing unit is used for extracting character string features of the malicious software, preprocessing the character string features and vectorizing the preprocessed character string features; and the third processing unit is used for performing attack organization attribution of the malicious software based on the vectorized code features and the character string features and respectively interpreting the code features and the character string feature classification results.

The device provided by the embodiment of the invention can simultaneously extract the code characteristics and the character string characteristics of the malicious software, thereby realizing the analysis of the attack organization attribution of the malicious software by integrating the static characteristics and the dynamic characteristics of the malicious software, and finally realizing the accurate analysis of the attack organization attribution of the malicious software.

The relevant content of the embodiments of the present invention can be understood by referring to the first embodiment of the present invention, and will not be discussed in detail herein.

A third embodiment of the present invention provides a computer-readable storage medium storing a signal-mapped computer program, which when executed by at least one processor, implements the method for interpretable determination of malware homing attack organization of any one of the first embodiments of the present invention.

Although the preferred embodiments of the present invention have been disclosed for illustrative purposes, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, and the scope of the invention should not be limited to the embodiments described above.

Claims

1. An interpretable decision method for a malware home attack organization, comprising:

extracting code features of the malicious software, preprocessing the code features, and vectorizing the preprocessed code features, wherein the code features are the malicious software features taking a function as a unit;

extracting character string features of malicious software, preprocessing the character string features, and vectorizing the preprocessed character string features;

and performing attack organization attribution of the malicious software based on the vectorized code features and the character string features, and respectively interpreting the classification results of the code features and the character string features.

2. The method of claim 1, wherein extracting code features of malware comprises:

extracting metadata of the malicious software, and converting the metadata into IR intermediate representation;

the metadata comprises a hash of the malware, a compiler, a function name of each function, a control flow graph CFG, basic blocks and byte codes.

3. The method of claim 2, wherein preprocessing the code features comprises:

and generalizing the low-frequency words after the IR intermediate representation conversion, and converting all functions of the malicious software into sequential texts according to the program call graph.

4. The method of claim 1, wherein vectorizing the pre-processed code features comprises:

a function vector is generated for the text of each function using the PV-DM algorithm.

5. The method according to any one of claims 1-4, wherein the extracting character string features of the malware comprises:

and extracting a behavior report of the malicious software through a hash value of the malicious software.

6. The method of claim 5, wherein preprocessing the string features comprises:

and segmenting the text in the behavior report.

7. The method according to any one of claims 1-4, wherein the vectorized code feature and character string feature based attack organization attribution of malware comprises:

obtaining the classification probability of each vectorized code feature through a random forest classifier;

obtaining the classification probability of the character string characteristics after vectorization through a DNN classifier;

and integrating the classification probabilities of the multiple code characteristics and the classification probabilities of the character string characteristics to obtain a final classification result of the malicious software.

8. The method of claim 7, wherein interpreting the code feature and the string feature classification results separately comprises:

and interpreting the code feature classification result through a random forest, and interpreting the character string feature classification result through LIME.

9. An interpretable decision apparatus for a malware home attack organization, comprising:

the first processing unit is used for extracting code features of malicious software, preprocessing the code features and vectorizing the preprocessed code features, wherein the code features are the malicious software features taking a function as a unit;

the second processing unit is used for extracting character string features of the malicious software, preprocessing the character string features and vectorizing the preprocessed character string features;

and the third processing unit is used for performing attack organization attribution of the malicious software based on the vectorized code features and the character string features and respectively interpreting the code features and the character string feature classification results.

10. A computer-readable storage medium, characterized in that it stores a signal-mapped computer program which, when executed by at least one processor, implements the interpretable decision method for a malware home attack organization of any one of claims 1-8.