CN116186702B

CN116186702B - Malicious software classification method and device based on cooperative attention

Info

Publication number: CN116186702B
Application number: CN202310160409.9A
Authority: CN
Inventors: 刘峰; 鲍怀锋; 王文; 汤子贤
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2024-02-13
Anticipated expiration: 2043-02-24
Also published as: CN116186702A

Abstract

The invention discloses a malicious software classification method and device based on cooperative attention, wherein the method comprises the following steps: acquiring an assembly instruction sequence and an API call sequence of malicious software; calculating a feature representation sequence S of an assembler instruction sequence ^op And a feature representation sequence S of API call sequences ^api The method comprises the steps of carrying out a first treatment on the surface of the Representing the characteristic sequence S ^op And a characteristic representation sequence S ^api Respectively inputting the transducer neural network models to obtain the static characteristic representation v of the malicious software ^op Dynamic feature representation v ^api And formalized representation H of assembler instruction sequence ^op Formalized representation H with API call sequence ^api The method comprises the steps of carrying out a first treatment on the surface of the Formalized representation H ^op And formalized representationH ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious softwareDynamic collaborative feature representationRepresenting static characteristics by v ^op Dynamic characteristics representation v ^api Static collaborative feature representationDynamic collaborative feature representationAnd classifying after connection to obtain a classification result of the malicious software. The invention can realize the classification of the malicious software.

Description

Malicious software classification method and device based on cooperative attention

Technical Field

The invention belongs to the field of network threat protection, relates to a malicious software classification technology, and particularly relates to a malicious software classification method and device based on cooperative attention.

Background

Malware classification is one of the main branches of cyber threat protection technology. The technology is mainly used for effectively identifying the newly-appearing malicious software by analyzing the code or behavior feature distribution of the captured malicious software through a feature code or an artificial intelligence based identification algorithm. Malware refers to software that gathers sensitive information, controls user devices, and severely infringes the personal interests of a user without the user's permission. As computers become an integral part of people's production and life, the impact of malware has spread from virtual network space to physical space. Malware production has shown a trend to streamline and refine. The ever-increasing number of malware presents a significant security threat to users and a significant challenge to security practitioners.

The traditional detection method based on feature codes cannot meet the current detection requirement of malicious software, and how to detect new malicious software variants becomes the key point of research in recent years. Note that malware presents code multiplexing, family continuation features for reasons of benefit considerations or programming habits, etc. If the newly appeared malicious software can be subjected to correct family attribution, the analysis efficiency of analysts can be greatly improved, and powerful evidence is provided for tracing. Existing machine learning-based malware family classification methods are classified into static analysis and dynamic analysis according to whether object code is executed at the time of detection. Static analysis refers to a method of performing program analysis without running the program. The static analysis object for the malicious software is generally binary program operation code n-gram sequence, PE header information, character strings, gray level diagrams and the like. The dynamic analysis can be performed only when the program runs, and the dynamic analysis method detects whether the abnormal behavior of the code runs or the influence of the code on the system is malicious or not by monitoring, and the classification characteristics are generally API sequences, system logs, memory changes and the like. However, the existing methods have two general problems:

-classification accuracy is low. The existing method has limited expression capability of the selected features, such as only using header information or character string information, only considering API sequences and not considering API function parameters, and has insufficient distinguishing power when facing the scenes with refined classification categories or less trainable samples, and is easy to cause the problem of error classification.

Poor interpretability. Most methods can only give classification results due to the black box nature of the machine learning itself. For example, many CNN classification methods based on gray-scale images, which use gray-scale image texture features to classify malware families, have poor interpretation, cannot provide evidence for application landing, and are not helpful to analysts. Few methods such as decision trees can give feature importance, but abstract features still cannot give intuitive feedback.

Disclosure of Invention

The invention aims to provide a method and a device for classifying malicious software based on cooperative attention, which mainly utilize a designed multi-source fusion feature coding algorithm, a sequence feature representation algorithm, a malicious software feature representation and classification algorithm and a classification decision importance visualization algorithm to respectively digitize each piece of assembly code or API call, and hierarchically calculate the feature representation of each software instance, wherein the feature representation is used for judging the malicious performance, and record the attention value and the visualization in the feature representation calculation process so as to realize the interpretable malicious software classification.

Firstly, collecting malicious software examples of different malicious software families, obtaining an assembly instruction sequence through disassembled software, and obtaining an API call sequence through sandbox simulation execution; carrying out numerical coding on each assembly code or API call by using a designed multi-source fusion characteristic coding algorithm; constructing a characteristic representation of each malicious software instance by using a designed malicious software characteristic representation and classification algorithm and calculating the family distribution probability of the characteristic representation; and then, visualizing classification influence factors of dynamic and static characteristics of the malicious software instance by using a designed malicious characteristic visualization analysis algorithm so as to explain classification results.

The technical scheme adopted by the invention is as follows:

a method of collaborative attention-based malware classification, the method comprising:

acquiring an assembly instruction sequence and an API call sequence of malicious software;

calculating a feature representation sequence S of an assembler instruction sequence ^op And a feature representation sequence S of API call sequences ^api ；

In the characteristic representation sequence S ^op And the characteristic representation sequence S ^api Pre-addition [ CLS ]]After the fields are encoded, respectively sending into a transducer neural network model V based on a self-attention mechanism, and carrying out [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Dynamic feature representation v ^api And based on the hidden layer characteristic sequence of the transducer neural network model V, obtaining formal representation H of the assembly instruction sequence ^op Formalized representation H with the API call sequence ^api ；

Formalized representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain the malicious softwareStatic collaborative feature representationDynamic collaborative feature representation ++>

Representing the static feature by v ^op The dynamic characteristics represent v ^api The static collaborative feature representationSaid dynamic collaborative feature representation +_>And classifying after connection to obtain a classification result of the malicious software.

Further, the acquiring the assembly instruction sequence and the API call sequence of the malicious software comprises the following steps:

extracting assembly codes of each function in the malicious software by using disassembly software to obtain an assembly instruction sequence;

and modeling the malicious software by using a sandbox to obtain an API call sequence.

Further, the feature representation sequence S of the calculation assembly instruction sequence ^OP And a feature representation sequence S of API call sequences ^api Comprising:

after the assembly instruction sequence and the API call sequence are respectively subjected to numerical coding, position coding is added to obtain a numerical coding sequence And the digitized coding sequence->

In the said numerical coding sequenceAnd the said coding sequence>Pre-addition [ CLS ]]After the fields are encoded, respectively sending a transducer neural network model V' based on a self-attention mechanism;

will [ CLS ]]Output of the corresponding-position transducer neural network model V' is used as a characteristic representation sequence S of the malicious software ^op Or a characteristic representation sequence S ^api 。

Further, the numerically encoding the assembler instruction sequence includes:

using word2vec to encode an operation code in the assembly instruction sequence to obtain a first numeric embedded sequence of the assembly instruction sequence;

encoding operands in the assembly instruction sequence by using a local sensitive hash algorithm to obtain a second digitized embedded sequence of the assembly instruction sequence;

counting the number of elements of the operand set, each operand, annotation information type and the occurrence number of printable character strings to obtain a third numeric embedded sequence of the assembly instruction sequence;

and obtaining a numeric coding result of the assembly instruction sequence based on the first numeric embedded sequence, the second numeric embedded sequence and the third numeric embedded sequence.

Further, the characteristic represents the sequence S ^OP Pre-addition [ CLS ]]After encoding the fields, send into a transducer neural network model V based on self-attention mechanism, and send [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Comprising:

in the characteristic representation sequence S ^OP Pre-addition [ CLS ]]After the field is encoded, a transducer neural network model V based on a self-attention mechanism is sent;

the formal table of the characteristic calculation process based on the transducer neural network model V is obtainedOutput O of the attention layer; wherein the formalized form of the characteristic calculation process isAnd o=a·v, Q, K, V respectively represent a query matrix, a key matrix, a value matrix, a representing a matrix containing each field and [ CLS ]]Attention to the degree of association of fields force d _h Is the dimension of the feature space;

linearly transforming said output O and using [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op 。

Further, said formalizing said representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious software Dynamic collaborative feature representation ++>Comprising the following steps:

computing the formalized representation H ^op And the formalized representation H ^api A relation matrix C between the two;

respectively calculate U ^op ＝tanh(W ^op H ^op +(W ^api H ^api ) C) and U ^api ＝tanh(W ^api H ^api +(W ^oh H ^op )C ^T )；U ^op U for collaborative intermediate representation of assembly instruction sequences ^api A collaborative intermediate representation of the sequence of API calls;

separately calculating the attention weights of the assembly functionsAnd attention weight of API call fragment +.>Wherein W is _b 、W ^op 、W ^api 、/>Is a linear mapping parameter;

computing a static collaborative feature representationAnd dynamic collaborative feature representationWherein (1)>Representing the formalized representation H ^op The mth vector of (a) represents->Representing the formalized representation H ^api M-th vector representation, M-representation formalized representation H ^op Formalized representation H ^api Length of->Representing the attention weight alpha ^op Is represented by the mth vector.

Further, the method further comprises:

recording a classification influence factor; the classification influencing factor: the method comprises the steps that an assembler instruction and an API (application program interface) call calculated by a transducer neural network model V based on a self-attention mechanism are used for constructing importance coefficients for assembly function features and API call subsequence features, and an assembler function and API call subsequence calculated by a neural network model based on collaborative attention are used for constructing importance coefficients for the features;

The classification factors are visualized to interpret the classification results of the malware.

A cooperative attention-based malware classification device, the device comprising:

the preprocessing module is used for acquiring an assembly instruction sequence and an API call sequence of the malicious software;

the dynamic and static feature sequence normalization module is used for calculating feature representation sequence S of assembly instruction sequence ^op And a feature representation sequence S of API call sequences ^api ；

A malware characteristic representation and classification module for representing a sequence S in the characteristic representation ^op And the characteristic representation sequence S ^api Pre-addition [ CLS ]]After the fields are encoded, respectively sending into a transducer neural network model V based on a self-attention mechanism, and carrying out [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Dynamic feature representation v ^api And based on the hidden layer characteristic sequence of the transducer neural network model V, obtaining formal representation H of the assembly instruction sequence ^op Formalized representation H with the API call sequence ^api The method comprises the steps of carrying out a first treatment on the surface of the Formalized representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious software Dynamic collaborative feature representation ++>Representing the static feature by v ^op The dynamic characteristics represent v ^api Said static synergy feature representation->Said dynamic collaborative feature representation +_>And classifying after connection to obtain a classification result of the malicious software.

An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of the above.

A computer readable storage medium storing a computer program which, when executed by a computer, implements the method of any one of the preceding claims.

Compared with the prior art, the invention has at least the following advantages:

1) The method can combine the static and dynamic information of the sample to carry out fine granularity coding and retain semantic information and self parameter information. In addition, the hierarchical self-attention and collaborative attention-based neural network model is utilized to extract and classify the features, and the attention mechanism is utilized in the feature extraction process to more intuitively explain the classification result while aggregating the coding information.

2) The invention provides a feature coding algorithm aiming at multi-source fusion of malicious software, and the method can comprehensively utilize multi-source information such as operation codes, operands, APIs, API parameters and the like, thereby improving the expression capability of features.

3) The invention provides a sequence feature representation algorithm, a malicious software feature representation and a classification algorithm, and the method is combined, so that the malicious software is unlimited, the variable-length dynamic and static feature sequences are mapped into a high-dimensional feature representation space layer by layer, and the static function feature representation and the dynamic API call fragment feature representation of the middle layer are associated, so that the method has more general and efficient representation effect and classification capability.

4) The invention provides a classification decision importance visualization algorithm, wherein in the calculation process of each malicious software feature representation, the importance coefficient of an assembly function/API call subsequence for feature construction calculated based on a neural network model of cooperative attention and the importance coefficient of an assembly instruction/API call for feature construction of the assembly function feature/API call subsequence are recorded, hierarchical visualization presentation is carried out based on the importance coefficients, and visual prediction classification probability interpretation is given to security analysts.

Drawings

FIG. 1 is a flow chart of the proposed collaborative attention based malware classification method and system.

FIG. 2 is an example diagram of a feature encoding method operand set abstraction for proposed malware multisource fusion.

FIG. 3 is an example diagram of a feature encoding method API call parameter set abstraction for proposed malware multisource fusion.

FIG. 4 is an example 2 malware classification result assembler instruction importance visualization

FIG. 5 is a visualization of malware classification result API call importance for example 2

Detailed Description

The present invention will be described in further detail with reference to specific examples and illustrations.

The method and the device for classifying the malicious software based on the cooperative attention are suitable for fusing the dynamic and static characteristics of multiple sources of the malicious software, extracting the vectorization representation of the malicious software, and simultaneously carrying out hierarchical visual presentation on classified key elements to give visual prediction classification probability interpretation to security analysts.

The invention discloses a cooperative attention-based malicious software classification method, which is shown in fig. 1 and comprises the following steps:

step 1: and acquiring an assembler instruction sequence and an API call sequence of the malicious software.

The invention collects malware examples of different malware families; and acquiring an assembly instruction sequence set and an API call sequence set through disassembly software for all malicious software examples by sandboxed simulation execution.

In a preferred embodiment of the present invention, in order to obtain a more complete representation of the characteristics of the malware, it is necessary to collect as comprehensively as possible the dynamic and static information of the malware in different families, specifically, disassemble the malware by IDA to obtain the disassembled code containing the information such as function boundaries, system APIs, readable string variables in the data segment, etc.; dynamic behavior information of the malicious software is monitored by using Cuckoo Sandbox, and a parallelized processing function is designed to extract an API call sequence from a Cuckoo analysis report.

Step 2: calculation assemblyCharacteristic representation sequence S of instruction sequence ^op And a feature representation sequence S of API call sequences ^api 。

The step is completed based on a feature coding algorithm and a sequence feature representation algorithm of multi-source fusion.

The feature coding algorithm of the multi-source fusion sequentially extracts a numerical embedded sequence by using a feature extraction function for assembly instructions and API calls of a malicious software instance, specifically, each assembly instruction or API call of each instance of the malicious software is counted by using a counting method to count the field occurrence frequency, and important feature field occurrence frequency codes (third numerical embedded sequence) are obtained; encoding the operation code and the API function name by using a word embedding algorithm to capture semantic similarity, so as to obtain a word embedding representation (a first numeric embedding sequence) of the operation code/API call name; for each assembler instruction or API call, the operand, API function parameters are encoded using a locality-sensitive hashing algorithm to capture similarities of the operation set, resulting in encoding of the operand in the assembler instruction and the parameter names and parameter values in the API call (second digitized embedded sequence). Finally, based on the first, second and third quantized embedded sequences, a quantized encoding result of the assembly instruction sequence is obtained.

In a preferred embodiment of the present invention, the opcode/API call name word embedding encoding technique described in the feature encoding algorithm of the multisource fusion: the operation codes are encoded by using word2vec, and the word2vec is a common word embedding method in the field of natural language processing and is based on a shallow neural network. Semantic and syntactic information of words is characterized by learning context information of words in text, i.e. the more semantically similar words are closer together in an embedding space by learning a mapping of a word's original space to the embedding space. And (3) compiling function instruction sequences/API call subsequences in the fixed-length sliding window by using different malicious software as a corpus to train a word2vec model, and respectively obtaining word embedded representations of operation codes/API call names.

In a preferred embodiment of the present invention, the operand/API parameter locality sensitive hashing technique described in the feature encoding algorithm of multi-source fusion: and respectively encoding the operand in the assembly instruction and the parameter name and the parameter value in the API call by using a local sensitive hash algorithm. The local sensitive hash algorithm can make the sample sets which are similar in original space still similar when the correlation operation is mapped to a specific range space, and the samples which are dissimilar in original space still have great probability dissimilarity after the hash. Specifically, the process of applying the locality sensitive hashing algorithm, simhash, to operand encoding is as follows. First, the operands and annotation information are normalized. The fields which are easy to cause the Simhash coding specificity to be strong and bring difficulty to feature learning are normalized when some operand and annotation information exist, such as immediate numbers when some operand exists, a large number of variable names which are automatically named by addresses exist in the annotation information, and the like. The operands are classified and represented in abstract using predefined fields. Four types of operands of REG, MEM, CONST, MARK (function name, jump address, structure pointer address, etc.) and annotation information types are defined, so that the original operands are reconstructed. And then performing Simhash calculation, using MD5 as a hash algorithm, and adding the calculated hash value according to field weights to form a weighted digital string. Then, the obtained weighted digit string is subjected to dimension reduction, and for a certain position on the weighted digit string, the position is set to be 1 if the value of the position is larger than 0, and the position is set to be 0 if the value of the position is smaller than 0. The process of applying Simhash to API call parameter encoding is similar, except that the parameter name and the parameter value are processed separately, and the parameter value abstract fields are different. The API parameter values are divided into three categories, numbers, address values, and strings. For the number and address values, CONST and ADDR abstract representations are used; the string parameter value may contain key information such as IP address, URL, file path, DLL name, registry value, etc., and is analyzed in more detail. If an IP address and URL are detected, the IP and URL abstract representation is used. If the file path, DLL name and registry value are detected, the file path, DLL name and registry value are segmented and added into the parameter value set. And using the parameter value set after Simhash calculation processing as an API parameter value code.

In a preferred embodiment of the present invention, the statistical feature encoding technique described in the feature encoding algorithm of the multisource fusion: for each assembler instruction, the operation code is divided into seven types, namely a data transmission instruction, an arithmetic operation instruction, a logic operation instruction, a string instruction, a program transfer instruction, a pseudo instruction, a processing control instruction and others, the operation code is encoded by using onehot, and the number of elements of an operand set, each operand, annotation information type and the occurrence number of printable character strings are counted. For each API call, the Cuckoo samdbox monitors 312 API calls in total and divides the API calls into 17 classes, codes the API calls by using onehot, counts the value of each type of API parameter, the number of occurrences of 'MZ', counts the number of normalized parameter set elements and calculates the average length of a character string, so as to summarize the set as a whole.

The sequence feature representation algorithm uses a deep neural network model based on a self-attention mechanism for feature extraction.

In a preferred embodiment of the present invention, the hierarchical feature representation technique based on the self-attention neural network model described in the sequential feature representation algorithm: for each digitized instruction code sequence of the assembly function/API call fragment, the feature representation of the function level/API call fragment level is extracted by using a transducer based on a self-attention mechanism.

The following is a formalized representation of the instruction encoding sequence of the assembly function/the API call encoding sequence of the API call fragment:

W _m ＝(w _1m ,w _2m ,…,w _Nm )

wherein W is _m The sequence is the numeric sequence of the instruction code sequence of the mth assembly function of the malicious software or the API call code sequence of the API call fragment after adding position codes, and w _1m Encoding the assembly function instruction into the first assembly instruction/API call in the sequence/API call fragment, wherein N is a fixed length, exceeds the cut-off, is short of the zero-filling vector, and is w _Nm The sequence/API call fragment is encoded with the assembly function instruction with the numeric embedding of the Nth assembly instruction/API call in the sequence/API call fragment. At W _m Previously add [ CLS ]]Fields ofEncoding w _0m It is then fed into a transducer neural network model based on self-attention mechanisms.

The following is a formalized representation of the feature calculation process:

O _m ＝A _M ·V _m

wherein,queries, keys, values matrix respectively, where h is the number of self-attention heads, N is the truncated sequence length, d _h Is the dimension of the feature space. A is that _m Is an attention diagram comprising each field and [ CLS ]]The degree of association of the fields. O (O) _m Is the output of the self-attention layer, which is linearly transformed to [ CLS ] ]Model output of the corresponding location as a feature representation s of the assembly function/API call fragment _m 。

Step 3: in the characteristic representation sequence S ^op And the characteristic representation sequence S ^api Pre-addition [ CLS ]]After the fields are encoded, respectively sending into a transducer neural network model V based on a self-attention mechanism, and carrying out [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Dynamic feature representation v ^api And based on the hidden layer characteristic sequence of the transducer neural network model V, obtaining formal representation H of the assembly instruction sequence ^op Formalized representation H with the API call sequence ^api 。

The method comprises the steps that firstly, a deep neural network model based on a self-attention mechanism is used for carrying out feature extraction layer by a malicious software feature representation and classification algorithm, and a function level/API call fragment level and a malicious software level dynamic and static feature representation are calculated layer by layer aiming at a dynamic and static feature sequence of each malicious software; extracting semantic association between a static feature representation of a disassembled function level and a dynamic feature representation of an API call fragment level by using a neural network model based on a cooperative attention mechanism, and calculating a dynamic and static cooperative feature representation of a malicious software level fusing cross information; and connecting the dynamic and static characteristic representation of the malicious software, and calculating family classification probability of the malicious software as input of a machine learning classifier.

In a preferred embodiment of the present invention, the hierarchical feature representation technique based on the self-attention neural network model described in the malware feature representation and classification algorithm:

for each assembly function feature representation sequence/API call fragment feature representation sequence calculated, a malware-level dynamic and static feature representation is extracted using a self-attention mechanism-based transducer.

The following is a formalized representation of the assembly function feature representation sequence/API call fragment feature representation sequence:

S＝(s ₁ ,s ₂ ,…,s _M )

wherein S is a numerical characteristic representation sequence of a static assembly function/dynamic call fragment of malicious software after adding position codes. Adding [ CLS ] before S]Coding s of fields ₀ It is then fed into a transducer neural network model based on self-attention mechanisms.

O＝A·V

wherein,query, key, value matrix respectively, where h is the number of self-attention heads, N is the truncated sequence length, d _h Is the dimension of the feature space. A is an attention diagram, which contains each field and [ CLS ]]The degree of association of the fields. O is the output of the self-attention layer, which is linearly transformed to [ CLS ] ]Corresponding positionIs output as a static feature representation v of the malware ^op Dynamic characteristics representation v ^api 。

In a preferred embodiment of the present invention, the feature representation technique based on the collaborative attention neural network model described in the malware feature representation and classification algorithm: given each assembly function feature representation sequence/API call fragment feature representation sequence of a certain malware, a neural network model based on a collaborative attention mechanism is used to extract a dynamic and static collaborative feature representation of the malware level.

The following is a formalized representation of the assembly function feature representation sequence and the API call fragment feature representation sequence:

wherein H is ^op 、H ^api Is a hidden layer characteristic sequence of a static assembly function/dynamic calling fragment sequence of malicious software after the position coding is added and the numerical characteristic representation sequence is subjected to a transducer model,and (5) carrying out hidden layer characteristic representation after the sequence of the numerical characteristic representation after the position coding is added for the Mth static assembly function/dynamic calling segment through a transducer model. Computing H using neural network model based on collaborative attention mechanisms ^op 、H ^api An association matrix between the two.

The following is a formalized representation of the correlation matrix calculation process:

U ^op ＝tanh(W ^op H ^op +(W ^api H ^api )C)

U ^api ＝tanh(W ^api H ^api +(W ^op H ^op )C ^T )

Wherein W is _b 、W ^op 、W ^api 、Is a linear mapping parameter, C is used as a relationship matrix to calculate the correlation coefficient between the assembly function and the API call fragment. Alpha ^op 、α ^api Attention weights for the assembly function and the API call fragment, respectively. Based on the weights, a dynamic and static collaborative feature representation of the malware level is calculated.

The following is a dynamic and static characteristic representationFormalized form of calculation process:

in a preferred embodiment of the present invention, the machine learning based malware classification technique described in the malware characterization and classification algorithm: given dynamic and static feature representation and dynamic and static collaborative feature representation of a certain malicious software, the malicious software is classified by using a multi-layer perceptron model and a traditional machine learning model after weighted summation and connection.

In a preferred embodiment of the present invention, the present invention further comprises a classification decision significance visualization method. The classification decision importance visualization method is characterized in that in the process of constructing the malware characteristic representation, classification influence factors of dynamic and static characteristics of the malware instance are recorded and visualized to explain classification results. Specifically, the importance coefficients of the compilation functions and the API call subsequences calculated based on the neural network model of the cooperative attention to the feature construction are recorded; recording importance coefficients of assembly instructions and API calls calculated based on the self-attention neural network model on assembly function features and API call subsequence feature construction; and carrying out hierarchical visual presentation based on the importance coefficients, and giving visual prediction classification probability interpretation to security analysts.

In a preferred embodiment of the present invention, the collaborative attention-based importance calculation technique described in the classification decision importance visualization algorithm: given dynamic and static collaborative feature representation of a certain malicious software, attention weight is recorded as a classification influence factor of an assembly function/API fragment, standardized to be within a [0,1] interval, and importance degree is represented by color depth visualization.

In a preferred embodiment of the present invention, the self-attention based hierarchical importance calculation technique described in the classification decision importance visualization algorithm: given the dynamic and static characteristic representation of some malicious software, the attention weight is recorded as a classification influence factor of an assembly function/API fragment, standardized to be within a [0,1] interval, and the importance degree is represented by color depth visualization. Given the characteristic representation sequence of the assembly function/API call fragment of some malicious software, the attention weight in the calculation process is recorded as

The classification influencing factors for assembler instructions/API calls are normalized to within the [0,1] interval, and the importance level is represented by color shade visualization.

Fig. 2 and 3 are diagrams of examples of a feature encoding method for multi-source fusion of malware.

Example 1 malware family classification was performed using a multi-source fused feature encoding algorithm and malware feature representation and classification algorithm.

Taking a self-collection Dataset Dataset-I and two public datasets Big 2015 and Catak 2019 as example datasets, wherein the Dataset-I Dataset contains 20070malware of 61 malware families, and performing disassembly by using IDA, and collecting a dynamic API call sequence by using Cuckoo Sandbox; the Big 2015 dataset contains an opcode sequence for a total of 10809 malware samples of 9 types of malware; the Catak 2019 dataset contains API call sequences for a total of 7107 malware samples of class 8 malware.

1) Firstly, respectively carrying out three experiments on three data sets to obtain dynamic and static information of malicious software in Dataset-I;

2) Extracting three-level static information of malicious software-assembly functions-assembly instructions and three-level dynamic information of malicious software-API call fragments-API calls from the dynamic and static information obtained in the step 1);

3) Training word2vec word embedding models respectively by using the assembly instructions and API calls obtained in the step 2);

4) Using the word2vec word embedded model obtained in the step 3) to respectively carry out numerical coding on the operation codes in the assembly instructions and the API names in the API calls;

5) Performing numerical coding on the abstracted operand and annotation information set in the assembly instruction and the abstracted API parameter name and value set in the API call by using a locality sensitive hash algorithm Simhash to obtain statistical characteristics;

6) Concatenating the numeric codes and statistical features obtained in 4) and 5) to form a numeric code sequence for the assembler instruction in each function and the API call in the API fragment;

7) Calculating a feature representation sequence of a function level/API fragment level using the self-attention-based neural network model for assembly instructions in the functions obtained in 6) and the API call numeralization coding sequence in the API fragment, and similarly calculating a static/dynamic feature representation of a malware level using the self-attention-based neural network model on the feature representation sequence of the function level/API fragment level; on the basis of the feature representation sequence of the function level/API fragment level, calculating static/dynamic collaborative feature representation of the malicious software level by using a neural network model based on collaborative attention; the malware-level static/dynamic feature representation and the static/dynamic collaborative feature representation are weighted and summed and connected to obtain the feature representation of the malware;

8) Training a malware classification model using the malware characterization representation obtained in 7). For the Dataset-I Dataset, a classification model of 61 malware families is built, for the Big 2015 Dataset, a classification model of 9 malware families is built, and for the Catak 2019 Dataset, a classification model of 8 malware families is built.

Comparing the results of the method of the invention with other methods, the baseline method (MalConv, CNN+SVM, biLSTM) was trained on the training set of the Dataset-I Dataset, tested on the partitioned test set, and demonstrated classification accuracy (%) and F1 value (%).

TABLE 1 accuracy and F1 value of malware classification on Dataset-I Dataset by this method, among other methods

Evaluation index	The method of the invention	MalConv	CNN+SVM	BiLSTM
					Accuracy rate of	93.47	90.26	77.73	84.46
F1 value	93.22	92.25	77.81	85.34

Comparing the results of the method of the present invention with other methods, the baseline method (MalConv, malCSV, CNN +BiLSTM) was trained on the training set of Big 2015 dataset, tested on the divided test set, and demonstrated classification accuracy (%) and F1 value (%).

TABLE 2 accuracy and F1 value of the method in classifying malware on Big 2015 dataset with other methods

Evaluation index	The method of the invention	MalConv	MalCSV	CNN+BiLSTM
					Accuracy rate of	98.75	96.41	97.72	98.20
F1 value	98.75	88.94	97.91	96.05

Comparing the results of the method of the present invention with other methods, the baseline method (LSTM, transducer, GRU) was trained on the training set of the Catak2019 dataset, tested on the partitioned test set, and demonstrated classification accuracy (%) and F1 value (%).

TABLE 3 accuracy and F1 values of the method in classifying malware on Catak2019 dataset with other methods

Evaluation index	The method of the invention	LSTM	Transformer	GRU
					Accuracy rate of	61.10	47	41	55
F1 value	62.15	47	56.89	55

Example 2 interpreting classification probabilities for malware samples using a classification decision importance visualization algorithm

A case study using one malicious sample in the malware family neshata of Dataset-I in example 1 illustrates the interpretation of the malware classification probability predictions by the inventive method.

1) Given this dynamic and static collaborative feature representation of malware, its collaborative attention weight and self-attention weight to the [ CLS ] field are recorded as classification influencing factors for the assembly function/API fragment. Weighted and normalized to the [0,1] interval, and the importance degree is represented by the visualization of the red shade.

2) And (3) giving a characteristic representation sequence of an assembly function/API call fragment of some malicious software, recording attention weight in the calculation process as a classification influence factor of the assembly instruction/API call, standardizing to be within a [0,1] interval, multiplying the classification influence factor of the assembly function/API call fragment, and visually representing the importance degree by yellow shade.

According to the results of table 1, table 2, table 3, fig. 4 and fig. 5, the superiority of the cooperative attention-based malware classification method proposed by the method of the present invention is shown.

The invention uses statistical characteristics, local sensitive hash and word embedding methods to describe semantic and structural similarity of malicious anti-disassembly codes and API tracking for each malicious software instance to be judged, uses a hierarchical collaborative attention network model to extract vectorization representation of the malicious software for family classification, and intuitively interprets classification results through calculated attention force diagram of each layer. Firstly, acquiring assembly instruction sequences and API call sequences for sandboxed simulation execution by disassembling software for malicious software instances of different malicious software families; then each piece of assembly code or API call is converted into a numeric coding sequence by using a designed multi-source fusion characteristic coding algorithm; then, constructing a characteristic representation of each malicious software instance by using a designed malicious software characteristic representation and classification algorithm and calculating family distribution probability of the characteristic representation; and then, utilizing a designed classification decision importance visualization algorithm to record and visualize classification influence factors of dynamic and static characteristics of the malicious software instance so as to explain classification results.

Based on the same inventive concept, another embodiment of the present invention provides a cooperative attention-based malware classification system, comprising:

A malware characteristic representation and classification module for representing a sequence S in the characteristic representation ^op And the characteristic representation sequence S ^api Pre-addition [ CLS ]]After the fields are encoded, respectively sending into a transducer neural network model V based on a self-attention mechanism, and carrying out [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Dynamic feature representation v ^api And based on the hidden layer characteristic sequence of the transducer neural network model V, obtaining formal representation H of the assembly instruction sequence ^op Formalized representation H with the API call sequence ^api The method comprises the steps of carrying out a first treatment on the surface of the Formalized representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious softwareDynamic collaborative feature representation ++ >Representing the static feature by v ^op The dynamic characteristics represent v ^api Said static synergy feature representation->Said dynamic collaborative feature representation +_>And classifying after connection to obtain a classification result of the malicious software.

In a preferred embodiment of the present invention, the classification decision importance visualization module is configured to record and visualize classification influencing factors of dynamic and static characteristics of a malware instance during a malware feature representation construction process, so as to interpret classification results.

Wherein the specific implementation of each module is referred to the previous description of the method of the present invention.

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims

1. A method of collaborative attention-based malware classification, the method comprising:

calculating a feature representation sequence S of an assembler instruction sequence ^op And a feature representation sequence S of API call sequences ^api The method comprises the steps of carrying out a first treatment on the surface of the Wherein the feature of the calculation assembly instruction sequence represents a sequence S ^op And a feature representation sequence S of API call sequences ^api Comprising:

after the assembly instruction sequence and the API call sequence are respectively subjected to numerical coding, position coding is added to obtain a numerical coding sequenceAnd the digitized coding sequence->

will [ CLS ]]Output of the corresponding-position transducer neural network model V' is used as a characteristic representation sequence S of the malicious software ^op Or a characteristic representation sequence S ^api ；

Formalized representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious softwareDynamic collaborative feature representation ++>Wherein said formalizing said representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious software>Dynamic collaborative feature representation ++>Comprising the following steps:

computing a static collaborative feature representationAnd dynamic collaborative feature representation-> Wherein (1)>Representing the formalized representation H ^op The mth vector of (a) represents->Representing the formalized representation H ^api M-th vector representation, M-representation formalized representation H ^op Formalized representation H ^api Length of->Representing the attention weight alpha ^op An mth vector representation;

2. The method of claim 1, wherein the obtaining the sequence of assembler instructions and the sequence of API calls for malware comprises:

3. The method of claim 1, wherein said numerically encoding said sequence of assembler instructions comprises:

4. The method of claim 1, wherein the characteristic represents a sequence S ^OP Pre-addition [ CLS ]]After encoding the fields, send into a transducer neural network model V based on self-attention mechanism, and send [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Comprising:

Obtaining output O of the self-attention layer based on a formalized table of a characteristic calculation process of the transducer neural network model V; wherein the formalized form of the characteristic calculation process isAnd o=a·v, Q, K, V respectively represent a query matrix, a key matrix, a value matrix, a representing a matrix containing each field and [ CLS ]]Attention to the degree of association of fields force d _h Is the dimension of the feature space;

for the saidOutput O is linearly transformed and converted by [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op 。

5. The method of any one of claims 1-4, wherein the method further comprises:

recording a classification influence factor; wherein the classification influence factor comprises: the method comprises the steps that an assembler instruction and an API (application program interface) call calculated by a transducer neural network model V based on a self-attention mechanism are used for constructing importance coefficients for assembly function features and API call subsequence features, and an assembler function and API call subsequence calculated by a neural network model based on collaborative attention are used for constructing importance coefficients for the features;

and visualizing the classification influence factors to explain classification results of the malicious software.

6. A cooperative attention-based malware classification device, the device comprising:

the dynamic and static feature sequence normalization module is used for calculating feature representation sequence S of assembly instruction sequence ^op And a feature representation sequence S of API call sequences ^api The method comprises the steps of carrying out a first treatment on the surface of the Wherein the feature of the calculation assembly instruction sequence represents a sequence S ^op And a feature representation sequence S of API call sequences ^api Comprising:

A malware characteristic representation and classification module for representing a sequence S in the characteristic representation ^op And the characteristic representation sequence S ^api Pre-addition [ CLS ] ]After the fields are encoded, respectively sending into a transducer neural network model V based on a self-attention mechanism, and carrying out [ CLS ]]Output of the corresponding-position transducer neural network model V as a static feature representation V of the malware ^op Dynamic feature representation v ^api And based on the hidden layer characteristic sequence of the transducer neural network model V, obtaining formal representation H of the assembly instruction sequence ^op Formalized representation H with the API call sequence ^api The method comprises the steps of carrying out a first treatment on the surface of the Formalized representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious softwareDynamic collaborative feature representation ++>Representing the static feature by v ^op The dynamic characteristics represent v ^api Said static synergy feature representation->The dynamic collaborative feature tableShow->Classifying after connection to obtain a classification result of the malicious software; wherein said formalizing said representation H ^op And the formalized representation H ^api Inputting a neural network model based on cooperative attention to obtain a static cooperative characteristic representation of the malicious software>Dynamic collaborative feature representation ++>Comprising the following steps:

computing a static collaborative feature representationAnd dynamic collaborative feature representation-> Wherein (1)>Representing the formalized representation H ^op The mth vector of (a) represents->Representing the formalized representation H ^api M-th vector representation, M-representation formalized representation H ^op Formalized representation H ^api Length of->Representing the attention weight alpha ^op Is represented by the mth vector.

7. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-5.