CN111027069A - Malicious software family detection method, storage medium and computing device - Google Patents

Malicious software family detection method, storage medium and computing device Download PDF

Info

Publication number
CN111027069A
CN111027069A CN201911202586.9A CN201911202586A CN111027069A CN 111027069 A CN111027069 A CN 111027069A CN 201911202586 A CN201911202586 A CN 201911202586A CN 111027069 A CN111027069 A CN 111027069A
Authority
CN
China
Prior art keywords
malware
sample
layer
family
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911202586.9A
Other languages
Chinese (zh)
Other versions
CN111027069B (en
Inventor
孙玉霞
宋涛
赵晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
University of Jinan
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201911202586.9A priority Critical patent/CN111027069B/en
Publication of CN111027069A publication Critical patent/CN111027069A/en
Application granted granted Critical
Publication of CN111027069B publication Critical patent/CN111027069B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Computer Hardware Design (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Virology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a malicious software family detection method, a storage medium and a computing device, wherein the method comprises the steps of respectively extracting the characteristics of all malicious software training samples of each class in a malicious software training set to obtain a plurality of corresponding characteristic vectors; respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair; taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model; and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value. The method and the device can be used for correctly detecting the category of the malicious software, and have a good classification effect.

Description

Malicious software family detection method, storage medium and computing device
Technical Field
The invention relates to the technical field of software security, in particular to a malicious software family detection method, a storage medium and a computing device.
Background
Malicious software is implanted into a computer of a victim by a hacker or an attacker through a security hole of an operating system or application software, so that the normal operation of a user is influenced, and malicious operations such as collection of sensitive information, stealing of super user rights and the like are performed. Generally, mainstream malware includes malware, exploits, backdoors, worms, trojan horses, spyware, rootkits, and the like, as well as combinations or variations of the above types. Malware is rapidly spread by means of the various pathways provided by the internet, affecting the proper functioning of the network. In recent years, the amount of malware has increased exponentially, making it difficult for malware analysts and antivirus software vendors to extract information from these large-scale data for analysis.
The emergence of new malware families brings new threats, which is worthy of attention of security researchers. Meanwhile, most of the existing research is dedicated to classifying malware with similar behaviors or characteristics into known malware families, but the classification method does not have the capability of distinguishing new malware families because samples of the new malware families do not participate in the training process. Therefore, how to correctly and effectively detect a new family of malware is an important research problem.
The progress of deep learning influences the solution of problems in various fields such as natural language processing, computer vision and the like, and gets rid of the dependence on characteristic engineering, so that a plurality of tasks become easier, and some tasks are better than those of human beings. In a general classification task, such a condition needs to be satisfied: the class to which the test set samples belong is consistent with the class to which the training set samples belong. By learning to distinguish samples of all known classes in the training set, the model has the ability to determine the class to which a test sample belongs. However, deep neural networks have one known disadvantage: when a sample of a class is not present in the prediction training set, it may output a value that is "too confident," i.e., overfitting. This is because the vector value output by the neural network usually has a sum of 1 for each class probability, when a sample of an unknown class is input, it still outputs the probabilities of each class, and the sum is still 1, which results in the problem that the neural network is "too confident" about what it has not seen. Thus, if the sample under test belongs to an unknown class (i.e., a class for which it is not trained), the neural network does not have the ability to output the correct results, thereby causing a misclassification. The same problem exists in the field of malware classification research, assuming that all samples of each existing malware family are collected for training, but because of natural antagonism in the field of malware research, malware authors will continuously release new malware families, so in a relatively open malware classification environment, a sample to be tested may belong to a known family in a training set or a new family that does not exist in the training set, and if a traditional classification manner is adopted, misclassification may be caused. In view of the above problems, it is necessary to develop a new malware family detection technique, that is, to detect a sample to be detected that does not belong to all known families in the training set, and label the detected sample as a new malware family.
In practical situations, many new families of malware are unrecorded or even noticed. At the same time, it is important for security researchers to quickly understand samples of a new malware family. Once it is detected that malware belongs to a new malicious family, they can look preferentially at this file, manually analyze its behavior (e.g., network activity, system calls, etc.), and remove it better only if they are aware of the malware-related behavior. In short, detecting a new malware family can mitigate new threats to cyberspace security to some extent.
In summary, it is of great importance to research a new family of malware detection technology and apply the technology to a relatively open malware detection environment. On one hand, the method can avoid the misclassification of the new malware family into the known malware family, and on the other hand, can help security researchers to pay attention to the new malware family in time.
Disclosure of Invention
The first purpose of the present invention is to overcome the drawbacks and deficiencies of the prior art, and to provide a malware family detection method, which can correctly detect the category to which malware belongs, and has a good classification effect.
A second object of the present invention is to provide a storage medium.
It is a third object of the invention to provide a computing device.
The first purpose of the invention is realized by the following technical scheme: a malware family detection method comprises the following steps:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
Preferably, in step S1, feature extraction is performed on the malware training samples to obtain corresponding feature vectors, and the process is as follows:
preprocessing a malware training sample: performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file;
traversing all the text files stored with the keywords, constructing a dictionary according to the keywords in the text files, counting the occurrence frequency of each keyword, and deleting the keywords with the occurrence frequency equal to the sample number in the dictionary;
according to the occurrence times of the keywords, ordering the dictionaries in a descending order, and taking N keywords with the highest occurrence times as a new dictionary;
initializing an N-dimensional vector, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in a new dictionary or not,
if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;
and the traversed N-dimensional binary vector is taken as a feature vector.
Furthermore, the sandbox is used for preprocessing the malware training samples, specifically, the malware training samples are submitted to the sandbox to be operated, and the sandbox generates a text file containing a behavior analysis report for each piece of malware.
Furthermore, the extracted keywords are unigrams, and the report file is a json report file.
Preferably, in step S2, the feature vectors are converted into feature images, an image pair is generated from the feature images, a twin network model is constructed, and the model is trained by using the image pair, as follows:
calculating the pixel value of each bit in the feature vector: mapping a bit value of 0 to a pixel value of 0 and a bit value of 1 to a pixel value of 255;
converting the N-dimensional eigenvector into an X multiplied by Y pixel matrix, wherein N is X.Y, X is the row number of the pixel matrix, and Y is the column number of the pixel matrix;
converting the pixel matrix into a characteristic image;
pairing the characteristic images pairwise to form a large number of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs;
constructing a twin network model: selecting a sub-network type of the twin network, and determining parameter configuration of a twin network model;
taking the image pair as an input to train the twin network model, and outputting the similarity of the two characteristic vectors by the twin network model;
calculating the loss function L (x)1,x2Y), loss function L (x)1,x2Y) the calculation formula is as follows:
L(x1,x2,y)=-(ylogp(x1,x2)+(1-y)log(1-p(x1,x2)))+λ||w||2
wherein x is1And x2Two characteristic images which are respectively an image pair; p (x)1,x2) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor2Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;
minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training.
Furthermore, the sub-network is a convolutional neural network, and the twin network model comprises an input layer, 4 convolutional layers, 3 pooling layers, 3 full-connection layers and an output layer;
the input layer has 2 input dimensions; the 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels adopt the size of 5 multiplied by 5, and the activation function is ReLU; the 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 multiplied by 2; the 3 full-connection layers are respectively a first full-connection layer, a second full-connection layer and a third full-connection layer, the number of neurons of the first full-connection layer is 4096, the number of neurons of the second full-connection layer is 2048, and the number of neurons of the third full-connection layer is 1;
the input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].
Preferably, in step S3, the samples to be tested are taken out from the malware test set, and the trained twin network model is used to calculate the similarity score between each sample to be tested and the malware training sample, where the process is as follows:
step 1, taking out a sample to be tested from a malware test set;
step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;
step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;
and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.
Preferably, a threshold is calculated, and the sample to be tested is distinguished as a known malware family or a new malware family according to the threshold, specifically as follows:
taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;
sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds;
selecting a temporary threshold with the highest F1 score as a final threshold;
distinguishing the class of the sample to be tested according to a threshold value, and marking the class of the sample to be tested as a new malware family when the class of the sample to be tested does not belong to the known malware family in the training set;
the discrimination formula used is specifically as follows:
Figure BDA0002296237160000051
wherein X is a sample to be detected, and ND is a new family detector of the malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.
The second purpose of the invention is realized by the following technical scheme: a storage medium stores a program that, when executed by a processor, implements the malware family detection method according to the first object of the present invention.
The third purpose of the invention is realized by the following technical scheme: a computing device comprising a processor and a memory for storing processor-executable programs, the processor, when executing the programs stored in the memory, implementing the malware family detection method of the first object of the present invention.
Compared with the prior art, the invention has the following advantages and effects:
(1) the invention relates to a malicious software family detection method, which comprises the steps of firstly, respectively extracting the characteristics of all malicious software training samples of each class in a malicious software training set to obtain a plurality of corresponding characteristic vectors; respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair; taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model; and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value. The detection method realizes the detection of the category of the malicious software through three steps of feature extraction, twin network design and novelty measurement, and has high detection accuracy and good classification effect; malware that does not belong to a known malware family in the training set is detected and labeled as a new malware family, and new threats to cyber-space security can be mitigated to some extent.
(2) The malicious software family detection method combines the twin network and the characteristic image, and has higher precision ratio, recall ratio, F1 score and correct ratio, and lower false alarm rate and false missing rate.
(3) In the method for detecting the malicious software family, the extracted features are not artificial features, but are automatically extracted according to the characteristics of the malicious software in operation, samples which are distributed differently from training samples are not additionally added in the training period, a plurality of training models are not needed in the process of extracting the new features, and the method is simple in process and has high popularization value.
Drawings
FIG. 1 is a flow chart of the malware family detection method of the present invention.
Fig. 2 is a flow chart of a feature vector generation process.
Fig. 3 is a flowchart of the feature image generation process.
Fig. 4 is a structural diagram of a twin network model.
FIG. 5 is a flow chart of a novelty measure.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
The embodiment discloses a malware family detection method, as shown in fig. 1, including the following steps:
s1, feature extraction: respectively performing feature extraction on all malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors, as shown in fig. 2, the process is as follows:
s11, preprocessing the malware training sample: and performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file. Each malware is a corresponding text file.
In this embodiment, a sandbox is used to preprocess a malware training sample, specifically, the malware training sample is submitted to the sandbox to be run, and the sandbox generates a text file containing a behavior analysis report for each malware. The sandbox may be a Cuckoo sandbox, which is a special system environment that records the behavior of programs running therein, such as API function calls, parameters passed, files created or deleted, websites and ports accessed, etc.
In this embodiment, the extracted keywords are unigrams, and the report file is a json report file. All unigrams are extracted and de-duplicated, e.g., given the sequence "api": DeleteFileW ", the extracted unigrams are" "api": and "(" DeleteFileW ","), and the json report file is saved as a txt text file.
S12, traversing all text files in which keywords are stored, constructing a dictionary according to the keywords in the text files, and counting the occurrence frequency of each keyword, and deleting the keywords whose occurrence frequency is equal to the sample number in the dictionary, that is, deleting the common keywords, for example, in this embodiment, deleting unigrams without valid information, such as field names json.
And S13, sorting the dictionaries according to the occurrence times of the keywords and the descending order, and taking the N keywords with the highest occurrence times as a new dictionary. Since this embodiment N is 20000, the dictionary of this embodiment stores the keywords of top20000 of all malware.
S14, initializing a vector with N dimensions, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in the new dictionary or not,
if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;
and the traversed N-dimensional binary vector is taken as a feature vector.
S2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair, wherein the process comprises the following steps:
s21, as shown in fig. 3, calculating the pixel value of each bit in the feature vector: a bit value of 0 is mapped to a pixel value of 0 and a bit value of 1 is mapped to a pixel value of 255.
The N-dimensional eigenvector is converted into an X × Y pixel matrix, where N is X · Y, X is the number of rows of the pixel matrix, and Y is the number of columns of the pixel matrix. This embodiment specifically converts the feature vector of 20000 dimensions into a pixel matrix of 200 × 100.
The pixel matrix is converted into a feature image.
And S22, pairing the characteristic images in pairs to form a plurality of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs. Similar image pairs may also be referred to as positive sample pairs and dissimilar image pairs may also be referred to as negative sample pairs.
S23, constructing a twin network model: selecting the sub-network type of the twin network, and determining the parameter configuration of the twin network model.
In this embodiment, the twin network model has a structure as shown in fig. 4, the sub-network is a convolutional neural network CNN, and the twin network model includes an input layer, 4 convolutional layers, 3 pooling layers, 3 fully-connected layers, and an output layer.
Parameter configuration as shown in table 1, the input layer has 2 input dimensions x1 and x 2. The 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels are 5 multiplied by 5 in size, and the activation function is ReLU. The 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 x 2. The number of the neurons of the first full connection layer is 4096, the number of the neurons of the second full connection layer is 2048, and the number of the neurons of the third full connection layer is 1.
The input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].
TABLE 1
Figure BDA0002296237160000091
Figure BDA0002296237160000101
And S24, training the twin network model by taking the image pair as input, and outputting the similarity of the two feature vectors by the twin network model.
S25, calculating a loss function L (x)1,x2Y), the loss function is the cross entropy of the two classes between the prediction and the target, and the calculation formula is as follows:
L(x1,x2,y)=-(ylogp(x1,x2)+(1-y)log(1-p(x1,x2)))+λ||w||2
wherein x is1And x2Two characteristic images which are respectively an image pair; p (x)1,x2) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor2Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;
minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training. The number of training rounds is specifically set to 20 rounds in this embodiment.
S3, novelty measure: as shown in fig. 5, samples to be tested are taken out from the malware test set, and the trained twin network model is used to count the similarity score between each sample to be tested and the malware training sample;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
Wherein, the calculation process of the similarity score is as follows:
step 1, taking out a sample to be tested from a malware test set;
step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;
step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;
and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.
Calculating a threshold value, and distinguishing whether the sample to be detected is a known malware family or a new malware family according to the threshold value, wherein the method specifically comprises the following steps:
(1) taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;
(2) sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds; the F1 score can be calculated by the usual F1 calculation formula. The constant value in this example is 0.1.
(3) And selecting the temporary threshold with the highest F1 score as the final threshold.
(4) And distinguishing the classes of the samples to be detected according to a threshold value, and marking the classes of the samples to be detected as a new malware family when the classes of the samples to be detected do not belong to the known malware family in the training set.
The discrimination formula used is specifically as follows:
Figure BDA0002296237160000111
wherein, X is a sample to be detected, and ND (novelty detector) is a new family detector of malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.
Example 2
The embodiment discloses a storage medium, which stores a program, and when the program is executed by a processor, the method for detecting a malware family according to embodiment 1 is implemented, specifically as follows:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.
Example 3
The embodiment discloses a computing device, which includes a processor and a memory for storing an executable program of the processor, and when the processor executes the program stored in the memory, the method for detecting a malware family according to embodiment 1 is implemented, specifically as follows:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or other terminal device with a processor function.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A malware family detection method, comprising the steps of:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
2. The malware family detection method of claim 1, wherein in step S1, feature extraction is performed on the malware training samples to obtain corresponding feature vectors, and the process is as follows:
preprocessing a malware training sample: performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file;
traversing all the text files stored with the keywords, constructing a dictionary according to the keywords in the text files, counting the occurrence frequency of each keyword, and deleting the keywords with the occurrence frequency equal to the sample number in the dictionary;
according to the occurrence times of the keywords, ordering the dictionaries in a descending order, and taking N keywords with the highest occurrence times as a new dictionary;
initializing an N-dimensional vector, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in a new dictionary or not,
if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;
and the traversed N-dimensional binary vector is taken as a feature vector.
3. The malware family detection method of claim 2, wherein the malware training samples are preprocessed by a sandbox, specifically, the malware training samples are submitted to the sandbox for operation, and the sandbox generates a text file containing a behavior analysis report for each malware.
4. The malware family detection method of claim 2, wherein the extracted keywords are unigrams and the report file is a json report file.
5. The malware family detecting method of claim 1, wherein in step S2, the feature vectors are respectively converted into feature images, an image pair is generated according to the feature images, a twin network model is constructed, and the model is trained by using the image pair, and the procedures are as follows:
calculating the pixel value of each bit in the feature vector: mapping a bit value of 0 to a pixel value of 0 and a bit value of 1 to a pixel value of 255;
converting the N-dimensional eigenvector into an X multiplied by Y pixel matrix, wherein N is X.Y, X is the row number of the pixel matrix, and Y is the column number of the pixel matrix;
converting the pixel matrix into a characteristic image;
pairing the characteristic images pairwise to form a large number of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs;
constructing a twin network model: selecting a sub-network type of the twin network, and determining parameter configuration of a twin network model;
taking the image pair as an input to train the twin network model, and outputting the similarity of the two characteristic vectors by the twin network model;
calculating the loss function L (x)1,x2Y), loss function L (x)1,x2Y) the calculation formula is as follows:
L(x1,x2,y)=-(y log p(x1,x2)+(1-y)log(1-p(x1,x2)))+λ||w||2
wherein x is1And x2Two characteristic images which are respectively an image pair; p (x)1,x2) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor2Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;
minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training.
6. The malware family detection method of claim 5, wherein the sub-network is a convolutional neural network, the twin network model comprises an input layer, 4 convolutional layers, 3 pooling layers, 3 fully-connected layers, and an output layer;
the input layer has 2 input dimensions; the 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels adopt the size of 5 multiplied by 5, and the activation function is ReLU; the 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 multiplied by 2; the 3 full-connection layers are respectively a first full-connection layer, a second full-connection layer and a third full-connection layer, the number of neurons of the first full-connection layer is 4096, the number of neurons of the second full-connection layer is 2048, and the number of neurons of the third full-connection layer is 1;
the input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].
7. The malware family detection method of claim 1, wherein in step S3, the samples to be tested are taken from the malware test set, and the similarity score between each sample to be tested and the malware training sample is calculated by using the trained twin network model, and the process is as follows:
step 1, taking out a sample to be tested from a malware test set;
step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;
step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;
and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.
8. The malware family detection method of claim 1, wherein a threshold is calculated, and the sample to be tested is distinguished as a known malware family or a new malware family according to the threshold, specifically as follows:
taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;
sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds;
selecting a temporary threshold with the highest F1 score as a final threshold;
distinguishing the class of the sample to be tested according to a threshold value, and marking the class of the sample to be tested as a new malware family when the class of the sample to be tested does not belong to the known malware family in the training set;
the discrimination formula used is specifically as follows:
Figure FDA0002296237150000041
wherein X is a sample to be detected, and ND is a new family detector of the malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.
9. A storage medium storing a program, wherein the program, when executed by a processor, implements the malware family detection method of any one of claims 1 to 8.
10. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the malware family detection method of any one of claims 1 to 8.
CN201911202586.9A 2019-11-29 2019-11-29 Malicious software family detection method, storage medium and computing device Active CN111027069B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911202586.9A CN111027069B (en) 2019-11-29 2019-11-29 Malicious software family detection method, storage medium and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911202586.9A CN111027069B (en) 2019-11-29 2019-11-29 Malicious software family detection method, storage medium and computing device

Publications (2)

Publication Number Publication Date
CN111027069A true CN111027069A (en) 2020-04-17
CN111027069B CN111027069B (en) 2022-04-08

Family

ID=70203636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911202586.9A Active CN111027069B (en) 2019-11-29 2019-11-29 Malicious software family detection method, storage medium and computing device

Country Status (1)

Country Link
CN (1) CN111027069B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783093A (en) * 2020-06-28 2020-10-16 南京航空航天大学 Malicious software classification and detection method based on soft dependence
CN111984780A (en) * 2020-09-11 2020-11-24 深圳市北科瑞声科技股份有限公司 Multi-intention recognition model training method, multi-intention recognition method and related device
CN112001424A (en) * 2020-07-29 2020-11-27 暨南大学 Malicious software open set family classification method and device based on countermeasure training
CN112000954A (en) * 2020-08-25 2020-11-27 莫毓昌 Malicious software detection method based on feature sequence mining and simplification
CN112329786A (en) * 2020-12-02 2021-02-05 深圳大学 Method, device and equipment for detecting copied image and storage medium
CN112347479A (en) * 2020-10-21 2021-02-09 北京天融信网络安全技术有限公司 False alarm correction method, device, equipment and storage medium for malicious software detection
CN112764791A (en) * 2021-01-25 2021-05-07 济南大学 Incremental updating malicious software detection method and system
WO2021151343A1 (en) * 2020-09-09 2021-08-05 平安科技(深圳)有限公司 Test sample category determination method and apparatus for siamese network, and terminal device
CN113392399A (en) * 2021-06-23 2021-09-14 绿盟科技集团股份有限公司 Malicious software classification method, device, equipment and medium
CN114139153A (en) * 2021-11-02 2022-03-04 武汉大学 Graph representation learning-based malware interpretability classification method
CN114611102A (en) * 2022-02-23 2022-06-10 西安电子科技大学 Visual malicious software detection and classification method and system, storage medium and terminal
WO2022171067A1 (en) * 2021-02-09 2022-08-18 北京有竹居网络技术有限公司 Video processing method and apparatus, and storage medium and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106803039A (en) * 2016-12-30 2017-06-06 北京神州绿盟信息安全科技股份有限公司 The homologous decision method and device of a kind of malicious file
CN108256325A (en) * 2016-12-29 2018-07-06 中移(苏州)软件技术有限公司 A kind of method and apparatus of the detection of malicious code mutation
CN109145605A (en) * 2018-08-23 2019-01-04 北京理工大学 A kind of Android malware family clustering method based on SinglePass algorithm
CN109670304A (en) * 2017-10-13 2019-04-23 北京安天网络安全技术有限公司 Recognition methods, device and the electronic equipment of malicious code family attribute
US10491627B1 (en) * 2016-09-29 2019-11-26 Fireeye, Inc. Advanced malware detection using similarity analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10491627B1 (en) * 2016-09-29 2019-11-26 Fireeye, Inc. Advanced malware detection using similarity analysis
CN108256325A (en) * 2016-12-29 2018-07-06 中移(苏州)软件技术有限公司 A kind of method and apparatus of the detection of malicious code mutation
CN106803039A (en) * 2016-12-30 2017-06-06 北京神州绿盟信息安全科技股份有限公司 The homologous decision method and device of a kind of malicious file
CN109670304A (en) * 2017-10-13 2019-04-23 北京安天网络安全技术有限公司 Recognition methods, device and the electronic equipment of malicious code family attribute
CN109145605A (en) * 2018-08-23 2019-01-04 北京理工大学 A kind of Android malware family clustering method based on SinglePass algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CORDONSKY I ETC.: "DeepOrigin: End-to-End Deep Learning for Detection of New Malware Families", 《2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)》 *
沈雁 等: "基于改进深度孪生网络的分类器及其应用", 《计算机工程与应用》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783093A (en) * 2020-06-28 2020-10-16 南京航空航天大学 Malicious software classification and detection method based on soft dependence
CN112001424B (en) * 2020-07-29 2023-05-23 暨南大学 Malicious software open set family classification method and device based on countermeasure training
CN112001424A (en) * 2020-07-29 2020-11-27 暨南大学 Malicious software open set family classification method and device based on countermeasure training
CN112000954A (en) * 2020-08-25 2020-11-27 莫毓昌 Malicious software detection method based on feature sequence mining and simplification
CN112000954B (en) * 2020-08-25 2024-01-30 华侨大学 Malicious software detection method based on feature sequence mining and simplification
WO2021151343A1 (en) * 2020-09-09 2021-08-05 平安科技(深圳)有限公司 Test sample category determination method and apparatus for siamese network, and terminal device
CN111984780A (en) * 2020-09-11 2020-11-24 深圳市北科瑞声科技股份有限公司 Multi-intention recognition model training method, multi-intention recognition method and related device
CN112347479A (en) * 2020-10-21 2021-02-09 北京天融信网络安全技术有限公司 False alarm correction method, device, equipment and storage medium for malicious software detection
CN112347479B (en) * 2020-10-21 2021-08-24 北京天融信网络安全技术有限公司 False alarm correction method, device, equipment and storage medium for malicious software detection
CN112329786A (en) * 2020-12-02 2021-02-05 深圳大学 Method, device and equipment for detecting copied image and storage medium
CN112329786B (en) * 2020-12-02 2023-06-16 深圳大学 Method, device, equipment and storage medium for detecting flip image
CN112764791B (en) * 2021-01-25 2023-08-08 济南大学 Incremental update malicious software detection method and system
CN112764791A (en) * 2021-01-25 2021-05-07 济南大学 Incremental updating malicious software detection method and system
WO2022171067A1 (en) * 2021-02-09 2022-08-18 北京有竹居网络技术有限公司 Video processing method and apparatus, and storage medium and device
CN113392399A (en) * 2021-06-23 2021-09-14 绿盟科技集团股份有限公司 Malicious software classification method, device, equipment and medium
CN114139153A (en) * 2021-11-02 2022-03-04 武汉大学 Graph representation learning-based malware interpretability classification method
CN114611102A (en) * 2022-02-23 2022-06-10 西安电子科技大学 Visual malicious software detection and classification method and system, storage medium and terminal

Also Published As

Publication number Publication date
CN111027069B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN111027069B (en) Malicious software family detection method, storage medium and computing device
CN110826059B (en) Method and device for defending black box attack facing malicious software image format detection model
Chawla et al. Host based intrusion detection system with combined CNN/RNN model
TWI673625B (en) Uniform resource locator (URL) attack detection method, device and electronic device
CN110704840A (en) Convolutional neural network CNN-based malicious software detection method
Vinayakumar et al. Evaluating deep learning approaches to characterize and classify the DGAs at scale
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
CN110135157B (en) Malicious software homology analysis method and system, electronic device and storage medium
CN112866023B (en) Network detection method, model training method, device, equipment and storage medium
CN112771523A (en) System and method for detecting a generated domain
CN107944273B (en) TF-IDF algorithm and SVDD algorithm-based malicious PDF document detection method
Yang et al. Detecting stealthy domain generation algorithms using heterogeneous deep neural network framework
CN113221112B (en) Malicious behavior identification method, system and medium based on weak correlation integration strategy
Hendler et al. Amsi-based detection of malicious powershell code using contextual embeddings
CN111382438A (en) Malicious software detection method based on multi-scale convolutional neural network
Alazab et al. Detecting malicious behaviour using supervised learning algorithms of the function calls
CN111400713B (en) Malicious software population classification method based on operation code adjacency graph characteristics
Kakisim et al. Sequential opcode embedding-based malware detection method
CN111967503A (en) Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
CN108959930A (en) Malice PDF detection method, system, data storage device and detection program
Rubin et al. Amsi-based detection of malicious powershell code using contextual embeddings
Zhu et al. Effective phishing website detection based on improved BP neural network and dual feature evaluation
Al Ogaili et al. Malware cyberattacks detection using a novel feature selection method based on a modified whale optimization algorithm
CN113762294B (en) Feature vector dimension compression method, device, equipment and medium
Maulana et al. Malware classification based on system call sequences using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant