CN111027069A - Malicious software family detection method, storage medium and computing device - Google Patents
Malicious software family detection method, storage medium and computing device Download PDFInfo
- Publication number
- CN111027069A CN111027069A CN201911202586.9A CN201911202586A CN111027069A CN 111027069 A CN111027069 A CN 111027069A CN 201911202586 A CN201911202586 A CN 201911202586A CN 111027069 A CN111027069 A CN 111027069A
- Authority
- CN
- China
- Prior art keywords
- malware
- sample
- layer
- family
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/566—Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Computer Hardware Design (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Virology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a malicious software family detection method, a storage medium and a computing device, wherein the method comprises the steps of respectively extracting the characteristics of all malicious software training samples of each class in a malicious software training set to obtain a plurality of corresponding characteristic vectors; respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair; taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model; and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value. The method and the device can be used for correctly detecting the category of the malicious software, and have a good classification effect.
Description
Technical Field
The invention relates to the technical field of software security, in particular to a malicious software family detection method, a storage medium and a computing device.
Background
Malicious software is implanted into a computer of a victim by a hacker or an attacker through a security hole of an operating system or application software, so that the normal operation of a user is influenced, and malicious operations such as collection of sensitive information, stealing of super user rights and the like are performed. Generally, mainstream malware includes malware, exploits, backdoors, worms, trojan horses, spyware, rootkits, and the like, as well as combinations or variations of the above types. Malware is rapidly spread by means of the various pathways provided by the internet, affecting the proper functioning of the network. In recent years, the amount of malware has increased exponentially, making it difficult for malware analysts and antivirus software vendors to extract information from these large-scale data for analysis.
The emergence of new malware families brings new threats, which is worthy of attention of security researchers. Meanwhile, most of the existing research is dedicated to classifying malware with similar behaviors or characteristics into known malware families, but the classification method does not have the capability of distinguishing new malware families because samples of the new malware families do not participate in the training process. Therefore, how to correctly and effectively detect a new family of malware is an important research problem.
The progress of deep learning influences the solution of problems in various fields such as natural language processing, computer vision and the like, and gets rid of the dependence on characteristic engineering, so that a plurality of tasks become easier, and some tasks are better than those of human beings. In a general classification task, such a condition needs to be satisfied: the class to which the test set samples belong is consistent with the class to which the training set samples belong. By learning to distinguish samples of all known classes in the training set, the model has the ability to determine the class to which a test sample belongs. However, deep neural networks have one known disadvantage: when a sample of a class is not present in the prediction training set, it may output a value that is "too confident," i.e., overfitting. This is because the vector value output by the neural network usually has a sum of 1 for each class probability, when a sample of an unknown class is input, it still outputs the probabilities of each class, and the sum is still 1, which results in the problem that the neural network is "too confident" about what it has not seen. Thus, if the sample under test belongs to an unknown class (i.e., a class for which it is not trained), the neural network does not have the ability to output the correct results, thereby causing a misclassification. The same problem exists in the field of malware classification research, assuming that all samples of each existing malware family are collected for training, but because of natural antagonism in the field of malware research, malware authors will continuously release new malware families, so in a relatively open malware classification environment, a sample to be tested may belong to a known family in a training set or a new family that does not exist in the training set, and if a traditional classification manner is adopted, misclassification may be caused. In view of the above problems, it is necessary to develop a new malware family detection technique, that is, to detect a sample to be detected that does not belong to all known families in the training set, and label the detected sample as a new malware family.
In practical situations, many new families of malware are unrecorded or even noticed. At the same time, it is important for security researchers to quickly understand samples of a new malware family. Once it is detected that malware belongs to a new malicious family, they can look preferentially at this file, manually analyze its behavior (e.g., network activity, system calls, etc.), and remove it better only if they are aware of the malware-related behavior. In short, detecting a new malware family can mitigate new threats to cyberspace security to some extent.
In summary, it is of great importance to research a new family of malware detection technology and apply the technology to a relatively open malware detection environment. On one hand, the method can avoid the misclassification of the new malware family into the known malware family, and on the other hand, can help security researchers to pay attention to the new malware family in time.
Disclosure of Invention
The first purpose of the present invention is to overcome the drawbacks and deficiencies of the prior art, and to provide a malware family detection method, which can correctly detect the category to which malware belongs, and has a good classification effect.
A second object of the present invention is to provide a storage medium.
It is a third object of the invention to provide a computing device.
The first purpose of the invention is realized by the following technical scheme: a malware family detection method comprises the following steps:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
Preferably, in step S1, feature extraction is performed on the malware training samples to obtain corresponding feature vectors, and the process is as follows:
preprocessing a malware training sample: performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file;
traversing all the text files stored with the keywords, constructing a dictionary according to the keywords in the text files, counting the occurrence frequency of each keyword, and deleting the keywords with the occurrence frequency equal to the sample number in the dictionary;
according to the occurrence times of the keywords, ordering the dictionaries in a descending order, and taking N keywords with the highest occurrence times as a new dictionary;
initializing an N-dimensional vector, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in a new dictionary or not,
if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;
and the traversed N-dimensional binary vector is taken as a feature vector.
Furthermore, the sandbox is used for preprocessing the malware training samples, specifically, the malware training samples are submitted to the sandbox to be operated, and the sandbox generates a text file containing a behavior analysis report for each piece of malware.
Furthermore, the extracted keywords are unigrams, and the report file is a json report file.
Preferably, in step S2, the feature vectors are converted into feature images, an image pair is generated from the feature images, a twin network model is constructed, and the model is trained by using the image pair, as follows:
calculating the pixel value of each bit in the feature vector: mapping a bit value of 0 to a pixel value of 0 and a bit value of 1 to a pixel value of 255;
converting the N-dimensional eigenvector into an X multiplied by Y pixel matrix, wherein N is X.Y, X is the row number of the pixel matrix, and Y is the column number of the pixel matrix;
converting the pixel matrix into a characteristic image;
pairing the characteristic images pairwise to form a large number of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs;
constructing a twin network model: selecting a sub-network type of the twin network, and determining parameter configuration of a twin network model;
taking the image pair as an input to train the twin network model, and outputting the similarity of the two characteristic vectors by the twin network model;
calculating the loss function L (x)1,x2Y), loss function L (x)1,x2Y) the calculation formula is as follows:
L(x1,x2,y)=-(ylogp(x1,x2)+(1-y)log(1-p(x1,x2)))+λ||w||2;
wherein x is1And x2Two characteristic images which are respectively an image pair; p (x)1,x2) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor2Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;
minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training.
Furthermore, the sub-network is a convolutional neural network, and the twin network model comprises an input layer, 4 convolutional layers, 3 pooling layers, 3 full-connection layers and an output layer;
the input layer has 2 input dimensions; the 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels adopt the size of 5 multiplied by 5, and the activation function is ReLU; the 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 multiplied by 2; the 3 full-connection layers are respectively a first full-connection layer, a second full-connection layer and a third full-connection layer, the number of neurons of the first full-connection layer is 4096, the number of neurons of the second full-connection layer is 2048, and the number of neurons of the third full-connection layer is 1;
the input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].
Preferably, in step S3, the samples to be tested are taken out from the malware test set, and the trained twin network model is used to calculate the similarity score between each sample to be tested and the malware training sample, where the process is as follows:
step 1, taking out a sample to be tested from a malware test set;
step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;
step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;
and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.
Preferably, a threshold is calculated, and the sample to be tested is distinguished as a known malware family or a new malware family according to the threshold, specifically as follows:
taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;
sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds;
selecting a temporary threshold with the highest F1 score as a final threshold;
distinguishing the class of the sample to be tested according to a threshold value, and marking the class of the sample to be tested as a new malware family when the class of the sample to be tested does not belong to the known malware family in the training set;
the discrimination formula used is specifically as follows:
wherein X is a sample to be detected, and ND is a new family detector of the malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.
The second purpose of the invention is realized by the following technical scheme: a storage medium stores a program that, when executed by a processor, implements the malware family detection method according to the first object of the present invention.
The third purpose of the invention is realized by the following technical scheme: a computing device comprising a processor and a memory for storing processor-executable programs, the processor, when executing the programs stored in the memory, implementing the malware family detection method of the first object of the present invention.
Compared with the prior art, the invention has the following advantages and effects:
(1) the invention relates to a malicious software family detection method, which comprises the steps of firstly, respectively extracting the characteristics of all malicious software training samples of each class in a malicious software training set to obtain a plurality of corresponding characteristic vectors; respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair; taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model; and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value. The detection method realizes the detection of the category of the malicious software through three steps of feature extraction, twin network design and novelty measurement, and has high detection accuracy and good classification effect; malware that does not belong to a known malware family in the training set is detected and labeled as a new malware family, and new threats to cyber-space security can be mitigated to some extent.
(2) The malicious software family detection method combines the twin network and the characteristic image, and has higher precision ratio, recall ratio, F1 score and correct ratio, and lower false alarm rate and false missing rate.
(3) In the method for detecting the malicious software family, the extracted features are not artificial features, but are automatically extracted according to the characteristics of the malicious software in operation, samples which are distributed differently from training samples are not additionally added in the training period, a plurality of training models are not needed in the process of extracting the new features, and the method is simple in process and has high popularization value.
Drawings
FIG. 1 is a flow chart of the malware family detection method of the present invention.
Fig. 2 is a flow chart of a feature vector generation process.
Fig. 3 is a flowchart of the feature image generation process.
Fig. 4 is a structural diagram of a twin network model.
FIG. 5 is a flow chart of a novelty measure.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.
Example 1
The embodiment discloses a malware family detection method, as shown in fig. 1, including the following steps:
s1, feature extraction: respectively performing feature extraction on all malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors, as shown in fig. 2, the process is as follows:
s11, preprocessing the malware training sample: and performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file. Each malware is a corresponding text file.
In this embodiment, a sandbox is used to preprocess a malware training sample, specifically, the malware training sample is submitted to the sandbox to be run, and the sandbox generates a text file containing a behavior analysis report for each malware. The sandbox may be a Cuckoo sandbox, which is a special system environment that records the behavior of programs running therein, such as API function calls, parameters passed, files created or deleted, websites and ports accessed, etc.
In this embodiment, the extracted keywords are unigrams, and the report file is a json report file. All unigrams are extracted and de-duplicated, e.g., given the sequence "api": DeleteFileW ", the extracted unigrams are" "api": and "(" DeleteFileW ","), and the json report file is saved as a txt text file.
S12, traversing all text files in which keywords are stored, constructing a dictionary according to the keywords in the text files, and counting the occurrence frequency of each keyword, and deleting the keywords whose occurrence frequency is equal to the sample number in the dictionary, that is, deleting the common keywords, for example, in this embodiment, deleting unigrams without valid information, such as field names json.
And S13, sorting the dictionaries according to the occurrence times of the keywords and the descending order, and taking the N keywords with the highest occurrence times as a new dictionary. Since this embodiment N is 20000, the dictionary of this embodiment stores the keywords of top20000 of all malware.
S14, initializing a vector with N dimensions, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in the new dictionary or not,
if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;
and the traversed N-dimensional binary vector is taken as a feature vector.
S2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair, wherein the process comprises the following steps:
s21, as shown in fig. 3, calculating the pixel value of each bit in the feature vector: a bit value of 0 is mapped to a pixel value of 0 and a bit value of 1 is mapped to a pixel value of 255.
The N-dimensional eigenvector is converted into an X × Y pixel matrix, where N is X · Y, X is the number of rows of the pixel matrix, and Y is the number of columns of the pixel matrix. This embodiment specifically converts the feature vector of 20000 dimensions into a pixel matrix of 200 × 100.
The pixel matrix is converted into a feature image.
And S22, pairing the characteristic images in pairs to form a plurality of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs. Similar image pairs may also be referred to as positive sample pairs and dissimilar image pairs may also be referred to as negative sample pairs.
S23, constructing a twin network model: selecting the sub-network type of the twin network, and determining the parameter configuration of the twin network model.
In this embodiment, the twin network model has a structure as shown in fig. 4, the sub-network is a convolutional neural network CNN, and the twin network model includes an input layer, 4 convolutional layers, 3 pooling layers, 3 fully-connected layers, and an output layer.
Parameter configuration as shown in table 1, the input layer has 2 input dimensions x1 and x 2. The 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels are 5 multiplied by 5 in size, and the activation function is ReLU. The 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 x 2. The number of the neurons of the first full connection layer is 4096, the number of the neurons of the second full connection layer is 2048, and the number of the neurons of the third full connection layer is 1.
The input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].
TABLE 1
And S24, training the twin network model by taking the image pair as input, and outputting the similarity of the two feature vectors by the twin network model.
S25, calculating a loss function L (x)1,x2Y), the loss function is the cross entropy of the two classes between the prediction and the target, and the calculation formula is as follows:
L(x1,x2,y)=-(ylogp(x1,x2)+(1-y)log(1-p(x1,x2)))+λ||w||2;
wherein x is1And x2Two characteristic images which are respectively an image pair; p (x)1,x2) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor2Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;
minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training. The number of training rounds is specifically set to 20 rounds in this embodiment.
S3, novelty measure: as shown in fig. 5, samples to be tested are taken out from the malware test set, and the trained twin network model is used to count the similarity score between each sample to be tested and the malware training sample;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
Wherein, the calculation process of the similarity score is as follows:
step 1, taking out a sample to be tested from a malware test set;
step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;
step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;
and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.
Calculating a threshold value, and distinguishing whether the sample to be detected is a known malware family or a new malware family according to the threshold value, wherein the method specifically comprises the following steps:
(1) taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;
(2) sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds; the F1 score can be calculated by the usual F1 calculation formula. The constant value in this example is 0.1.
(3) And selecting the temporary threshold with the highest F1 score as the final threshold.
(4) And distinguishing the classes of the samples to be detected according to a threshold value, and marking the classes of the samples to be detected as a new malware family when the classes of the samples to be detected do not belong to the known malware family in the training set.
The discrimination formula used is specifically as follows:
wherein, X is a sample to be detected, and ND (novelty detector) is a new family detector of malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.
Example 2
The embodiment discloses a storage medium, which stores a program, and when the program is executed by a processor, the method for detecting a malware family according to embodiment 1 is implemented, specifically as follows:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
The storage medium in this embodiment may be a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), a usb disk, a removable hard disk, or other media.
Example 3
The embodiment discloses a computing device, which includes a processor and a memory for storing an executable program of the processor, and when the processor executes the program stored in the memory, the method for detecting a malware family according to embodiment 1 is implemented, specifically as follows:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
The computing device described in this embodiment may be a desktop computer, a notebook computer, a smart phone, a PDA handheld terminal, a tablet computer, or other terminal device with a processor function.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
Claims (10)
1. A malware family detection method, comprising the steps of:
s1, feature extraction: respectively extracting features of all the malware training samples of each class in the malware training set to obtain a plurality of corresponding feature vectors;
s2, twin network design: respectively converting the plurality of feature vectors into feature images, generating an image pair according to the feature images, constructing a twin network model and training the model by using the image pair;
s3, novelty measure: taking out samples to be tested from the malware test set, and counting the similarity score of each sample to be tested and a malware training sample by using the trained twin network model;
and calculating a threshold value, and distinguishing the sample to be detected as a known malware family or a new malware family according to the threshold value.
2. The malware family detection method of claim 1, wherein in step S1, feature extraction is performed on the malware training samples to obtain corresponding feature vectors, and the process is as follows:
preprocessing a malware training sample: performing behavior analysis on each malicious software training sample to generate a corresponding report file, extracting all keywords in the report file, removing duplication, and storing the report file as a text file;
traversing all the text files stored with the keywords, constructing a dictionary according to the keywords in the text files, counting the occurrence frequency of each keyword, and deleting the keywords with the occurrence frequency equal to the sample number in the dictionary;
according to the occurrence times of the keywords, ordering the dictionaries in a descending order, and taking N keywords with the highest occurrence times as a new dictionary;
initializing an N-dimensional vector, wherein the N dimensions of the vector respectively correspond to N different keywords, traversing all the text files stored with the keywords again, judging whether the keywords appear in a new dictionary or not,
if yes, setting the corresponding dimension of the vector as 1; if not, setting the corresponding dimension of the vector as 0;
and the traversed N-dimensional binary vector is taken as a feature vector.
3. The malware family detection method of claim 2, wherein the malware training samples are preprocessed by a sandbox, specifically, the malware training samples are submitted to the sandbox for operation, and the sandbox generates a text file containing a behavior analysis report for each malware.
4. The malware family detection method of claim 2, wherein the extracted keywords are unigrams and the report file is a json report file.
5. The malware family detecting method of claim 1, wherein in step S2, the feature vectors are respectively converted into feature images, an image pair is generated according to the feature images, a twin network model is constructed, and the model is trained by using the image pair, and the procedures are as follows:
calculating the pixel value of each bit in the feature vector: mapping a bit value of 0 to a pixel value of 0 and a bit value of 1 to a pixel value of 255;
converting the N-dimensional eigenvector into an X multiplied by Y pixel matrix, wherein N is X.Y, X is the row number of the pixel matrix, and Y is the column number of the pixel matrix;
converting the pixel matrix into a characteristic image;
pairing the characteristic images pairwise to form a large number of image pairs, wherein the image pairs comprise similar image pairs and dissimilar image pairs;
constructing a twin network model: selecting a sub-network type of the twin network, and determining parameter configuration of a twin network model;
taking the image pair as an input to train the twin network model, and outputting the similarity of the two characteristic vectors by the twin network model;
calculating the loss function L (x)1,x2Y), loss function L (x)1,x2Y) the calculation formula is as follows:
L(x1,x2,y)=-(y log p(x1,x2)+(1-y)log(1-p(x1,x2)))+λ||w||2;
wherein x is1And x2Two characteristic images which are respectively an image pair; p (x)1,x2) Similarity for twin network model output; y is a label; lambada | | w | | non-conducting phosphor2Is the L2 weighted decay term; λ is the weight attenuation coefficient; w is the weight of the sub-network;
minimizing the calculation result of the loss function to enable the error between the output and the target output to be smaller and smaller until the twin network model converges; and when the twin network model reaches the number of training rounds, finishing the training.
6. The malware family detection method of claim 5, wherein the sub-network is a convolutional neural network, the twin network model comprises an input layer, 4 convolutional layers, 3 pooling layers, 3 fully-connected layers, and an output layer;
the input layer has 2 input dimensions; the 4 convolutional layers are respectively a first convolutional layer, a second convolutional layer, a third convolutional layer and a fourth convolutional layer, the number of convolutional kernels is 32, 64 and 128, the convolutional kernels adopt the size of 5 multiplied by 5, and the activation function is ReLU; the 3 pooling layers are respectively a first pooling layer, a second pooling layer and a third pooling layer, the 3 pooling layers are in maximum pooling, and the window size is 2 multiplied by 2; the 3 full-connection layers are respectively a first full-connection layer, a second full-connection layer and a third full-connection layer, the number of neurons of the first full-connection layer is 4096, the number of neurons of the second full-connection layer is 2048, and the number of neurons of the third full-connection layer is 1;
the input layer, the first convolutional layer, the first pooling layer, the second convolutional layer, the second pooling layer, the third convolutional layer, the third pooling layer, the fourth convolutional layer, 3 full-connection layers and the output layer are sequentially connected, wherein the fourth convolutional layer is fully connected with 4096 neurons, and an activation function is ReLU; and then fully connecting the neural network with 2048 neurons, wherein an activation function is ReLU, mapping an input feature image into two 2048-dimensional feature vectors h1 and h2, taking the absolute difference between h1 and h2 as the input of a third fully-connected layer, and converting the output into a probability through a sigmoid function in the third fully-connected layer, namely normalizing the output to be between [0,1 ].
7. The malware family detection method of claim 1, wherein in step S3, the samples to be tested are taken from the malware test set, and the similarity score between each sample to be tested and the malware training sample is calculated by using the trained twin network model, and the process is as follows:
step 1, taking out a sample to be tested from a malware test set;
step 2, aiming at each sample to be tested, calculating the similarity mean value of the sample to be tested and all the malware training samples in each class in the malware training set by using the trained twin network model;
step 3, taking the maximum value in the similarity mean value as the similarity score of the sample to be detected;
and 4, repeating the steps 1-3 until the samples to be tested of the malware test set are taken, and obtaining a similarity score corresponding to each sample to be tested.
8. The malware family detection method of claim 1, wherein a threshold is calculated, and the sample to be tested is distinguished as a known malware family or a new malware family according to the threshold, specifically as follows:
taking out a plurality of verification samples from the malware verification set, and counting the similarity score of each verification sample and a malware training sample by using a trained twin network model;
sequentially increasing the number of scores by taking a fixed value as a tolerance between the highest score and the lowest score of the similarity score to obtain a plurality of scores, and respectively calculating F1 scores of corresponding verification sets by taking the scores as temporary thresholds;
selecting a temporary threshold with the highest F1 score as a final threshold;
distinguishing the class of the sample to be tested according to a threshold value, and marking the class of the sample to be tested as a new malware family when the class of the sample to be tested does not belong to the known malware family in the training set;
the discrimination formula used is specifically as follows:
wherein X is a sample to be detected, and ND is a new family detector of the malicious software; score is the similarity score; τ is a suitable threshold; known family is a family of known malware; new family is a new malware family; otherwise denotes score ≦ τ.
9. A storage medium storing a program, wherein the program, when executed by a processor, implements the malware family detection method of any one of claims 1 to 8.
10. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the malware family detection method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911202586.9A CN111027069B (en) | 2019-11-29 | 2019-11-29 | Malicious software family detection method, storage medium and computing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911202586.9A CN111027069B (en) | 2019-11-29 | 2019-11-29 | Malicious software family detection method, storage medium and computing device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111027069A true CN111027069A (en) | 2020-04-17 |
CN111027069B CN111027069B (en) | 2022-04-08 |
Family
ID=70203636
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911202586.9A Active CN111027069B (en) | 2019-11-29 | 2019-11-29 | Malicious software family detection method, storage medium and computing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111027069B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111783093A (en) * | 2020-06-28 | 2020-10-16 | 南京航空航天大学 | Malicious software classification and detection method based on soft dependence |
CN111984780A (en) * | 2020-09-11 | 2020-11-24 | 深圳市北科瑞声科技股份有限公司 | Multi-intention recognition model training method, multi-intention recognition method and related device |
CN112001424A (en) * | 2020-07-29 | 2020-11-27 | 暨南大学 | Malicious software open set family classification method and device based on countermeasure training |
CN112000954A (en) * | 2020-08-25 | 2020-11-27 | 莫毓昌 | Malicious software detection method based on feature sequence mining and simplification |
CN112329786A (en) * | 2020-12-02 | 2021-02-05 | 深圳大学 | Method, device and equipment for detecting copied image and storage medium |
CN112347479A (en) * | 2020-10-21 | 2021-02-09 | 北京天融信网络安全技术有限公司 | False alarm correction method, device, equipment and storage medium for malicious software detection |
CN112764791A (en) * | 2021-01-25 | 2021-05-07 | 济南大学 | Incremental updating malicious software detection method and system |
WO2021151343A1 (en) * | 2020-09-09 | 2021-08-05 | 平安科技(深圳)有限公司 | Test sample category determination method and apparatus for siamese network, and terminal device |
CN113392399A (en) * | 2021-06-23 | 2021-09-14 | 绿盟科技集团股份有限公司 | Malicious software classification method, device, equipment and medium |
CN114139153A (en) * | 2021-11-02 | 2022-03-04 | 武汉大学 | Graph representation learning-based malware interpretability classification method |
CN114611102A (en) * | 2022-02-23 | 2022-06-10 | 西安电子科技大学 | Visual malicious software detection and classification method and system, storage medium and terminal |
WO2022171067A1 (en) * | 2021-02-09 | 2022-08-18 | 北京有竹居网络技术有限公司 | Video processing method and apparatus, and storage medium and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106803039A (en) * | 2016-12-30 | 2017-06-06 | 北京神州绿盟信息安全科技股份有限公司 | The homologous decision method and device of a kind of malicious file |
CN108256325A (en) * | 2016-12-29 | 2018-07-06 | 中移(苏州)软件技术有限公司 | A kind of method and apparatus of the detection of malicious code mutation |
CN109145605A (en) * | 2018-08-23 | 2019-01-04 | 北京理工大学 | A kind of Android malware family clustering method based on SinglePass algorithm |
CN109670304A (en) * | 2017-10-13 | 2019-04-23 | 北京安天网络安全技术有限公司 | Recognition methods, device and the electronic equipment of malicious code family attribute |
US10491627B1 (en) * | 2016-09-29 | 2019-11-26 | Fireeye, Inc. | Advanced malware detection using similarity analysis |
-
2019
- 2019-11-29 CN CN201911202586.9A patent/CN111027069B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10491627B1 (en) * | 2016-09-29 | 2019-11-26 | Fireeye, Inc. | Advanced malware detection using similarity analysis |
CN108256325A (en) * | 2016-12-29 | 2018-07-06 | 中移(苏州)软件技术有限公司 | A kind of method and apparatus of the detection of malicious code mutation |
CN106803039A (en) * | 2016-12-30 | 2017-06-06 | 北京神州绿盟信息安全科技股份有限公司 | The homologous decision method and device of a kind of malicious file |
CN109670304A (en) * | 2017-10-13 | 2019-04-23 | 北京安天网络安全技术有限公司 | Recognition methods, device and the electronic equipment of malicious code family attribute |
CN109145605A (en) * | 2018-08-23 | 2019-01-04 | 北京理工大学 | A kind of Android malware family clustering method based on SinglePass algorithm |
Non-Patent Citations (2)
Title |
---|
CORDONSKY I ETC.: "DeepOrigin: End-to-End Deep Learning for Detection of New Malware Families", 《2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN)》 * |
沈雁 等: "基于改进深度孪生网络的分类器及其应用", 《计算机工程与应用》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111783093A (en) * | 2020-06-28 | 2020-10-16 | 南京航空航天大学 | Malicious software classification and detection method based on soft dependence |
CN112001424B (en) * | 2020-07-29 | 2023-05-23 | 暨南大学 | Malicious software open set family classification method and device based on countermeasure training |
CN112001424A (en) * | 2020-07-29 | 2020-11-27 | 暨南大学 | Malicious software open set family classification method and device based on countermeasure training |
CN112000954A (en) * | 2020-08-25 | 2020-11-27 | 莫毓昌 | Malicious software detection method based on feature sequence mining and simplification |
CN112000954B (en) * | 2020-08-25 | 2024-01-30 | 华侨大学 | Malicious software detection method based on feature sequence mining and simplification |
WO2021151343A1 (en) * | 2020-09-09 | 2021-08-05 | 平安科技(深圳)有限公司 | Test sample category determination method and apparatus for siamese network, and terminal device |
CN111984780A (en) * | 2020-09-11 | 2020-11-24 | 深圳市北科瑞声科技股份有限公司 | Multi-intention recognition model training method, multi-intention recognition method and related device |
CN112347479A (en) * | 2020-10-21 | 2021-02-09 | 北京天融信网络安全技术有限公司 | False alarm correction method, device, equipment and storage medium for malicious software detection |
CN112347479B (en) * | 2020-10-21 | 2021-08-24 | 北京天融信网络安全技术有限公司 | False alarm correction method, device, equipment and storage medium for malicious software detection |
CN112329786A (en) * | 2020-12-02 | 2021-02-05 | 深圳大学 | Method, device and equipment for detecting copied image and storage medium |
CN112329786B (en) * | 2020-12-02 | 2023-06-16 | 深圳大学 | Method, device, equipment and storage medium for detecting flip image |
CN112764791B (en) * | 2021-01-25 | 2023-08-08 | 济南大学 | Incremental update malicious software detection method and system |
CN112764791A (en) * | 2021-01-25 | 2021-05-07 | 济南大学 | Incremental updating malicious software detection method and system |
WO2022171067A1 (en) * | 2021-02-09 | 2022-08-18 | 北京有竹居网络技术有限公司 | Video processing method and apparatus, and storage medium and device |
CN113392399A (en) * | 2021-06-23 | 2021-09-14 | 绿盟科技集团股份有限公司 | Malicious software classification method, device, equipment and medium |
CN114139153A (en) * | 2021-11-02 | 2022-03-04 | 武汉大学 | Graph representation learning-based malware interpretability classification method |
CN114611102A (en) * | 2022-02-23 | 2022-06-10 | 西安电子科技大学 | Visual malicious software detection and classification method and system, storage medium and terminal |
Also Published As
Publication number | Publication date |
---|---|
CN111027069B (en) | 2022-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111027069B (en) | Malicious software family detection method, storage medium and computing device | |
CN110826059B (en) | Method and device for defending black box attack facing malicious software image format detection model | |
Chawla et al. | Host based intrusion detection system with combined CNN/RNN model | |
TWI673625B (en) | Uniform resource locator (URL) attack detection method, device and electronic device | |
CN110704840A (en) | Convolutional neural network CNN-based malicious software detection method | |
Vinayakumar et al. | Evaluating deep learning approaches to characterize and classify the DGAs at scale | |
CN109302410B (en) | Method and system for detecting abnormal behavior of internal user and computer storage medium | |
CN110135157B (en) | Malicious software homology analysis method and system, electronic device and storage medium | |
CN112866023B (en) | Network detection method, model training method, device, equipment and storage medium | |
CN112771523A (en) | System and method for detecting a generated domain | |
CN107944273B (en) | TF-IDF algorithm and SVDD algorithm-based malicious PDF document detection method | |
Yang et al. | Detecting stealthy domain generation algorithms using heterogeneous deep neural network framework | |
CN113221112B (en) | Malicious behavior identification method, system and medium based on weak correlation integration strategy | |
Hendler et al. | Amsi-based detection of malicious powershell code using contextual embeddings | |
CN111382438A (en) | Malicious software detection method based on multi-scale convolutional neural network | |
Alazab et al. | Detecting malicious behaviour using supervised learning algorithms of the function calls | |
CN111400713B (en) | Malicious software population classification method based on operation code adjacency graph characteristics | |
Kakisim et al. | Sequential opcode embedding-based malware detection method | |
CN111967503A (en) | Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method | |
CN108959930A (en) | Malice PDF detection method, system, data storage device and detection program | |
Rubin et al. | Amsi-based detection of malicious powershell code using contextual embeddings | |
Zhu et al. | Effective phishing website detection based on improved BP neural network and dual feature evaluation | |
Al Ogaili et al. | Malware cyberattacks detection using a novel feature selection method based on a modified whale optimization algorithm | |
CN113762294B (en) | Feature vector dimension compression method, device, equipment and medium | |
Maulana et al. | Malware classification based on system call sequences using deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |