CN111027069A

CN111027069A - Malicious software family detection method, storage medium and computing device

Info

Publication number: CN111027069A
Application number: CN201911202586.9A
Authority: CN
Inventors: 孙玉霞; 宋涛; 赵晶晶
Original assignee: Jinan University
Current assignee: Jinan University; University of Jinan
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-17
Anticipated expiration: 2039-11-29
Also published as: CN111027069B

Abstract

The invention discloses a malicious software family detection method, a storage medium and a computing device, wherein the method comprises the steps of respectively extracting the characteristics of all malicious software training samples of each class in a malicious software training set to obtain a plurality of corresponding characteristic vectors; respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair; taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model; and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value. The method and the device can be used for correctly detecting the category of the malicious software, and have a good classification effect.

Description

Malicious software family detection method, storage medium and computing device

Technical Field

The invention relates to the technical field of software security, in particular to a malicious software family detection method, a storage medium and a computing device.

Background

Malicious software is implanted into a computer of a victim by a hacker or an attacker through a security hole of an operating system or application software, so that the normal operation of a user is influenced, and malicious operations such as collection of sensitive information, stealing of super user rights and the like are performed. Generally, mainstream malware includes malware, exploits, backdoors, worms, trojan horses, spyware, rootkits, and the like, as well as combinations or variations of the above types. Malware is rapidly spread by means of the various pathways provided by the internet, affecting the proper functioning of the network. In recent years, the amount of malware has increased exponentially, making it difficult for malware analysts and antivirus software vendors to extract information from these large-scale data for analysis.

The emergence of new malware families brings new threats, which is worthy of attention of security researchers. Meanwhile, most of the existing research is dedicated to classifying malware with similar behaviors or characteristics into known malware families, but the classification method does not have the capability of distinguishing new malware families because samples of the new malware families do not participate in the training process. Therefore, how to correctly and effectively detect a new family of malware is an important research problem.

The progress of deep learning influences the solution of problems in various fields such as natural language processing, computer vision and the like, and gets rid of the dependence on characteristic engineering, so that a plurality of tasks become easier, and some tasks are better than those of human beings. In a general classification task, such a condition needs to be satisfied: the class to which the test set samples belong is consistent with the class to which the training set samples belong. By learning to distinguish samples of all known classes in the training set, the model has the ability to determine the class to which a test sample belongs. However, deep neural networks have one known disadvantage: when a sample of a class is not present in the prediction training set, it may output a value that is "too confident," i.e., overfitting. This is because the vector value output by the neural network usually has a sum of 1 for each class probability, when a sample of an unknown class is input, it still outputs the probabilities of each class, and the sum is still 1, which results in the problem that the neural network is "too confident" about what it has not seen. Thus, if the sample under test belongs to an unknown class (i.e., a class for which it is not trained), the neural network does not have the ability to output the correct results, thereby causing a misclassification. The same problem exists in the field of malware classification research, assuming that all samples of each existing malware family are collected for training, but because of natural antagonism in the field of malware research, malware authors will continuously release new malware families, so in a relatively open malware classification environment, a sample to be tested may belong to a known family in a training set or a new family that does not exist in the training set, and if a traditional classification manner is adopted, misclassification may be caused. In view of the above problems, it is necessary to develop a new malware family detection technique, that is, to detect a sample to be detected that does not belong to all known families in the training set, and label the detected sample as a new malware family.

In practical situations, many new families of malware are unrecorded or even noticed. At the same time, it is important for security researchers to quickly understand samples of a new malware family. Once it is detected that malware belongs to a new malicious family, they can look preferentially at this file, manually analyze its behavior (e.g., network activity, system calls, etc.), and remove it better only if they are aware of the malware-related behavior. In short, detecting a new malware family can mitigate new threats to cyberspace security to some extent.

In summary, it is of great importance to research a new family of malware detection technology and apply the technology to a relatively open malware detection environment. On one hand, the method can avoid the misclassification of the new malware family into the known malware family, and on the other hand, can help security researchers to pay attention to the new malware family in time.

Disclosure of Invention

The first purpose of the present invention is to overcome the drawbacks and deficiencies of the prior art, and to provide a malware family detection method, which can correctly detect the category to which malware belongs, and has a good classification effect.

A second object of the present invention is to provide a storage medium.

It is a third object of the invention to provide a computing device.

The first purpose of the invention is realized by the following technical scheme: a malware family detection method comprises the following steps:

s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;

s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;

s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;

and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.

Preferably, in step S1, feature extraction is performed on the malware training samples to obtain corresponding feature vectors, and the process is as follows:

preprocessing a malware training sample: performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file;

traversing all the text files stored with the keywords, constructing a dictionary according to the keywords in the text files, counting the occurrence frequency of each keyword, and deleting the keywords with the occurrence frequency equal to the sample number in the dictionary;

according to the occurrence times of the keywords, ordering the dictionaries in a descending order, and taking N keywords with the highest occurrence times as a new dictionary;

initializing an N-dimensional vector, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in a new dictionary or not,

if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;

and the traversed N-dimensional binary vector is taken as a feature vector.

Furthermore, the sandbox is used for preprocessing the malware training samples, specifically, the malware training samples are submitted to the sandbox to be operated, and the sandbox generates a text file containing a behavior analysis report for each piece of malware.

Furthermore, the extracted keywords are unigrams, and the report file is a json report file.

Preferably, in step S2, the feature vectors are converted into feature images, an image pair is generated from the feature images, a twin network model is constructed, and the model is trained by using the image pair, as follows:

calculating the pixel value of each bit in the feature vector: mapping a bit value of 0 to a pixel value of 0 and a bit value of 1 to a pixel value of 255;

converting the N-dimensional eigenvector into an X multiplied by Y pixel matrix, wherein N is X.Y, X is the row number of the pixel matrix, and Y is the column number of the pixel matrix;

converting the pixel matrix into a characteristic image;

pairing the characteristic images pairwise to form a large number of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs;

constructing a twin network model: selecting a sub-network type of the twin network, and determining parameter configuration of a twin network model;

taking the image pair as an input to train the twin network model, and outputting the similarity of the two characteristic vectors by the twin network model;

calculating the loss function L (x)₁,x₂Y), loss function L (x)₁,x₂Y) the calculation formula is as follows:

L(x₁,x₂,y)＝-(ylogp(x₁,x₂)+(1-y)log(1-p(x₁,x₂)))+λ||w||₂；

wherein x is₁And x₂Two characteristic images which are respectively an image pair; p (x)₁,x₂) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor₂Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;

minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training.

Furthermore, the sub-network is a convolutional neural network, and the twin network model comprises an input layer, 4 convolutional layers, 3 pooling layers, 3 full-connection layers and an output layer;

the input layer has 2 input dimensions; the 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels adopt the size of 5 multiplied by 5, and the activation function is ReLU; the 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 multiplied by 2; the 3 full-connection layers are respectively a first full-connection layer, a second full-connection layer and a third full-connection layer, the number of neurons of the first full-connection layer is 4096, the number of neurons of the second full-connection layer is 2048, and the number of neurons of the third full-connection layer is 1;

the input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].

Preferably, in step S3, the samples to be tested are taken out from the malware test set, and the trained twin network model is used to calculate the similarity score between each sample to be tested and the malware training sample, where the process is as follows:

step 1, taking out a sample to be tested from a malware test set;

step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;

step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;

and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.

Preferably, a threshold is calculated, and the sample to be tested is distinguished as a known malware family or a new malware family according to the threshold, specifically as follows:

taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;

sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds;

selecting a temporary threshold with the highest F1 score as a final threshold;

distinguishing the class of the sample to be tested according to a threshold value, and marking the class of the sample to be tested as a new malware family when the class of the sample to be tested does not belong to the known malware family in the training set;

the discrimination formula used is specifically as follows:

wherein X is a sample to be detected, and ND is a new family detector of the malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.

The second purpose of the invention is realized by the following technical scheme: a storage medium stores a program that, when executed by a processor, implements the malware family detection method according to the first object of the present invention.

The third purpose of the invention is realized by the following technical scheme: a computing device comprising a processor and a memory for storing processor-executable programs, the processor, when executing the programs stored in the memory, implementing the malware family detection method of the first object of the present invention.

Compared with the prior art, the invention has the following advantages and effects:

(1) the invention relates to a malicious software family detection method, which comprises the steps of firstly, respectively extracting the characteristics of all malicious software training samples of each class in a malicious software training set to obtain a plurality of corresponding characteristic vectors; respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair; taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model; and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value. The detection method realizes the detection of the category of the malicious software through three steps of feature extraction, twin network design and novelty measurement, and has high detection accuracy and good classification effect; malware that does not belong to a known malware family in the training set is detected and labeled as a new malware family, and new threats to cyber-space security can be mitigated to some extent.

(2) The malicious software family detection method combines the twin network and the characteristic image, and has higher precision ratio, recall ratio, F1 score and correct ratio, and lower false alarm rate and false missing rate.

(3) In the method for detecting the malicious software family, the extracted features are not artificial features, but are automatically extracted according to the characteristics of the malicious software in operation, samples which are distributed differently from training samples are not additionally added in the training period, a plurality of training models are not needed in the process of extracting the new features, and the method is simple in process and has high popularization value.

Drawings

FIG. 1 is a flow chart of the malware family detection method of the present invention.

Fig. 2 is a flow chart of a feature vector generation process.

Fig. 3 is a flowchart of the feature image generation process.

Fig. 4 is a structural diagram of a twin network model.

FIG. 5 is a flow chart of a novelty measure.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Example 1

The embodiment discloses a malware family detection method, as shown in fig. 1, including the following steps:

s1, feature extraction: respectively performing feature extraction on all malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors, as shown in fig. 2, the process is as follows:

s11, preprocessing the malware training sample: and performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file. Each malware is a corresponding text file.

In this embodiment, a sandbox is used to preprocess a malware training sample, specifically, the malware training sample is submitted to the sandbox to be run, and the sandbox generates a text file containing a behavior analysis report for each malware. The sandbox may be a Cuckoo sandbox, which is a special system environment that records the behavior of programs running therein, such as API function calls, parameters passed, files created or deleted, websites and ports accessed, etc.

In this embodiment, the extracted keywords are unigrams, and the report file is a json report file. All unigrams are extracted and de-duplicated, e.g., given the sequence "api": DeleteFileW ", the extracted unigrams are" "api": and "(" DeleteFileW ","), and the json report file is saved as a txt text file.

S12, traversing all text files in which keywords are stored, constructing a dictionary according to the keywords in the text files, and counting the occurrence frequency of each keyword, and deleting the keywords whose occurrence frequency is equal to the sample number in the dictionary, that is, deleting the common keywords, for example, in this embodiment, deleting unigrams without valid information, such as field names json.

And S13, sorting the dictionaries according to the occurrence times of the keywords and the descending order, and taking the N keywords with the highest occurrence times as a new dictionary. Since this embodiment N is 20000, the dictionary of this embodiment stores the keywords of top20000 of all malware.

S14, initializing a vector with N dimensions, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in the new dictionary or not,

and the traversed N-dimensional binary vector is taken as a feature vector.

S2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair, wherein the process comprises the following steps:

s21, as shown in fig. 3, calculating the pixel value of each bit in the feature vector: a bit value of 0 is mapped to a pixel value of 0 and a bit value of 1 is mapped to a pixel value of 255.

The N-dimensional eigenvector is converted into an X × Y pixel matrix, where N is X · Y, X is the number of rows of the pixel matrix, and Y is the number of columns of the pixel matrix. This embodiment specifically converts the feature vector of 20000 dimensions into a pixel matrix of 200 × 100.

The pixel matrix is converted into a feature image.

And S22, pairing the characteristic images in pairs to form a plurality of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs. Similar image pairs may also be referred to as positive sample pairs and dissimilar image pairs may also be referred to as negative sample pairs.

S23, constructing a twin network model: selecting the sub-network type of the twin network, and determining the parameter configuration of the twin network model.

In this embodiment, the twin network model has a structure as shown in fig. 4, the sub-network is a convolutional neural network CNN, and the twin network model includes an input layer, 4 convolutional layers, 3 pooling layers, 3 fully-connected layers, and an output layer.

Parameter configuration as shown in table 1, the input layer has 2 input dimensions x1 and x 2. The 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels are 5 multiplied by 5 in size, and the activation function is ReLU. The 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 x 2. The number of the neurons of the first full connection layer is 4096, the number of the neurons of the second full connection layer is 2048, and the number of the neurons of the third full connection layer is 1.

TABLE 1

And S24, training the twin network model by taking the image pair as input, and outputting the similarity of the two feature vectors by the twin network model.

S25, calculating a loss function L (x)₁,x₂Y), the loss function is the cross entropy of the two classes between the prediction and the target, and the calculation formula is as follows:

L(x₁,x₂,y)＝-(ylogp(x₁,x₂)+(1-y)log(1-p(x₁,x₂)))+λ||w||₂；

minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training. The number of training rounds is specifically set to 20 rounds in this embodiment.

S3, novelty measure: as shown in fig. 5, samples to be tested are taken out from the malware test set, and the trained twin network model is used to count the similarity score between each sample to be tested and the malware training sample;

Wherein, the calculation process of the similarity score is as follows:

step 1, taking out a sample to be tested from a malware test set;

Calculating a threshold value, and distinguishing whether the sample to be detected is a known malware family or a new malware family according to the threshold value, wherein the method specifically comprises the following steps:

(1) taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;

(2) sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds; the F1 score can be calculated by the usual F1 calculation formula. The constant value in this example is 0.1.

(3) And selecting the temporary threshold with the highest F1 score as the final threshold.

(4) And distinguishing the classes of the samples to be detected according to a threshold value, and marking the classes of the samples to be detected as a new malware family when the classes of the samples to be detected do not belong to the known malware family in the training set.

The discrimination formula used is specifically as follows:

wherein, X is a sample to be detected, and ND (novelty detector) is a new family detector of malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.

Example 2

The embodiment discloses a storage medium, which stores a program, and when the program is executed by a processor, the method for detecting a malware family according to embodiment 1 is implemented, specifically as follows:

The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.

Example 3

The embodiment discloses a computing device, which includes a processor and a memory for storing an executable program of the processor, and when the processor executes the program stored in the memory, the method for detecting a malware family according to embodiment 1 is implemented, specifically as follows:

The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or other terminal device with a processor function.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A malware family detection method, comprising the steps of:

2. The malware family detection method of claim 1, wherein in step S1, feature extraction is performed on the malware training samples to obtain corresponding feature vectors, and the process is as follows:

and the traversed N-dimensional binary vector is taken as a feature vector.

3. The malware family detection method of claim 2, wherein the malware training samples are preprocessed by a sandbox, specifically, the malware training samples are submitted to the sandbox for operation, and the sandbox generates a text file containing a behavior analysis report for each malware.

4. The malware family detection method of claim 2, wherein the extracted keywords are unigrams and the report file is a json report file.

5. The malware family detecting method of claim 1, wherein in step S2, the feature vectors are respectively converted into feature images, an image pair is generated according to the feature images, a twin network model is constructed, and the model is trained by using the image pair, and the procedures are as follows:

converting the pixel matrix into a characteristic image;

L(x₁,x₂,y)＝-(y log p(x₁,x₂)+(1-y)log(1-p(x₁,x₂)))+λ||w||₂；

6. The malware family detection method of claim 5, wherein the sub-network is a convolutional neural network, the twin network model comprises an input layer, 4 convolutional layers, 3 pooling layers, 3 fully-connected layers, and an output layer;

7. The malware family detection method of claim 1, wherein in step S3, the samples to be tested are taken from the malware test set, and the similarity score between each sample to be tested and the malware training sample is calculated by using the trained twin network model, and the process is as follows:

step 1, taking out a sample to be tested from a malware test set;

8. The malware family detection method of claim 1, wherein a threshold is calculated, and the sample to be tested is distinguished as a known malware family or a new malware family according to the threshold, specifically as follows:

selecting a temporary threshold with the highest F1 score as a final threshold;

the discrimination formula used is specifically as follows:

9. A storage medium storing a program, wherein the program, when executed by a processor, implements the malware family detection method of any one of claims 1 to 8.

10. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the malware family detection method of any one of claims 1 to 8.