CN115098857A - Visual malicious software classification method and device - Google Patents
Visual malicious software classification method and device Download PDFInfo
- Publication number
- CN115098857A CN115098857A CN202210672136.1A CN202210672136A CN115098857A CN 115098857 A CN115098857 A CN 115098857A CN 202210672136 A CN202210672136 A CN 202210672136A CN 115098857 A CN115098857 A CN 115098857A
- Authority
- CN
- China
- Prior art keywords
- classification
- data set
- image
- basic block
- malware
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000000007 visual effect Effects 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 48
- 230000006870 function Effects 0.000 claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000011176 pooling Methods 0.000 claims abstract description 21
- 238000005295 random walk Methods 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000012958 reprocessing Methods 0.000 claims abstract description 4
- 238000012545 processing Methods 0.000 claims description 11
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 230000007704 transition Effects 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 abstract description 8
- 230000008569 process Effects 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 5
- 230000006399 behavior Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013145 classification model Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 235000015122 lemonade Nutrition 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 241000700605 Viruses Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 102000054765 polymorphisms of proteins Human genes 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- REQCZEXYDRLIBE-UHFFFAOYSA-N procainamide Chemical compound CCN(CC)CCNC(=O)C1=CC=C(N)C=C1 REQCZEXYDRLIBE-UHFFFAOYSA-N 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/561—Virus type analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9024—Graphs; Linked lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/53—Decompilation; Disassembly
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Virology (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method and a device for classifying visual malicious software, wherein the method comprises the following steps: preprocessing a malicious software data set to form a training data set; the random walk of the graph embedding model is used for reprocessing the training data set to generate a basic block sequence; adopting assembly code semantics and a control flow graph learning model to learn how to generate a function vector consisting of a basic block sequence; performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing the spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malware families. The device comprises: a processor and a memory. The invention utilizes SPP-Net (spatial pyramid pooling network) to classify the characteristic images finally, thereby improving the detection of visual malicious software and maintaining the security of network environment.
Description
Technical Field
The invention relates to the field of network security, in particular to a method and a device for classifying visual malicious software.
Background
The "Malware" is formed by the combination of two words of malicious and software. Malware definitions published by the chinese internet association: the malicious software is software which is installed and operated on a user computer or other terminals under the condition that a user is not explicitly prompted or the user is not authorized, and the legitimate rights and interests of the user are infringed, but the malicious software does not contain computer viruses specified by national laws and regulations.
With the rapid development of the internet, the network security problem is increasingly highlighted, and particularly in the field of malicious software, namely the explosive growth of the lasso software, huge economic losses are brought to users and enterprises using computer systems. According to the data of sonic wall, the growth rate of the internet of things malware attack and the like is 6% -19%, however, the growth trend of the lemonade attack is exponential, the number of the lemonade attack is increased by about 232% compared with 2019, and the number of the lemonade attack is increased by nearly 3.19 hundred million compared with 2020.
With the rise of emerging technologies such as cloud computing, internet of things, big data and 5G, the types of malicious software are more diversified, complicated programs such as encryption, deformation and self-protection appear, the number of the programs is larger, and the anti-malicious software technology is provided with huge challenges. Therefore, there is a need for faster, accurate, and continuously tracked anti-malware technologies to address this.
One of the major challenges faced by anti-malware is the need to evaluate the potential malicious intent and execution behavior of large amounts of data and files. One of the main reasons for these large numbers of different files is that to evade detection, malware developers introduce polymorphisms into the malicious components, which means that malicious files belonging to the same malware "family" have the same form of malicious behavior, which can be continually modified or obfuscated using various policies, making them look like many different files.
Thus, the characteristic of "family" can be utilized to classify malware, and to better understand the way malware infects computers and devices, the level of threat posed, and the corresponding precautions.
There are also complete data sets that have been proposed in prior work for training malware classification models, such as: the data set used hereinafter to explain the present work, a malware family classification challenge race from Microsoft corporation published on the kaggle website in 2015. The data set collects a total of 10868 malware for nine families, removes the PE (portable executable file) header of the file through pre-processing, and provides the disassembled result of the file (. asm file) and the binary representation file (. byte file), while the 10868 files are given labels to indicate to which family the file belongs.
Based on this data set, a number of methods have also been created for family classification of malware. For example: the method for converting the binary file into the image comprises the following steps: the method is originally proposed by Nataraj and Karthikeyan et al in 2011, the ideas of the Nataraj and Karthikeyan et al are novel, a binary file is displayed in a gray-scale image form, and classification of malicious files is achieved by means of GIST (global feature) texture features in an image and a KNN (K-nearest neighbor algorithm) algorithm. Then, many researchers have developed and searched based on this idea, and in China, typical ones are researches conducted by korea and the like, which perform feature extraction on a gray-scale image by a gray-scale co-occurrence matrix algorithm, then calculate features with the largest weight by a feature selection algorithm, and establish a texture feature library after normalization. For another example: the assembly code detection method comprises the following steps: in 2018, Guosong Sun et al vectorize OpCodes, then merge static analysis and assembly codes (extracted from unlabeled data) with time characteristics generated by RNN (recurrent neural network), process the merged codes by using minhash, generate characteristic images, and finally train the characteristic images by using CNN (convolutional neural network) for classification.
The assembly code detection method has already proposed the use of assembly operation codes for malicious software detection and classification, but only uses the precedence order of the operation codes for processing, thereby losing the connection between semantics.
Disclosure of Invention
The invention provides a classification method and a device of visual malicious software, which are used for accurately and quickly identifying the family classification of the malicious software, applying a disassembly technology and a deep learning algorithm to the field of malicious software analysis and providing a whole set of complete malicious software family classification flow; the embedded graph model Node2vecWalk and the assembly language model are used for integrating and processing sequence semantic information and structure information of the PE file obtained by disassembling, so as to further obtain a characteristic image of the file, and SPP-Net (space pyramid pooling network) is used for final classification, so that the detection of visual malicious software is improved, the safety of a network environment is maintained, and the detailed description is as follows:
in a first aspect, a classification method for visualized malicious software comprises the following steps:
preprocessing a malicious software data set to form a training data set;
reprocessing the training data set by using the random walk of the graph embedding model to generate a basic block sequence;
adopting assembly code semantics and a control flow graph learning model to learn how to generate a function vector consisting of a basic block sequence;
performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing the spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malware families.
Wherein the pretreatment comprises the following steps: generation of a disassembled file, extraction of a subprocess, conversion of a control structure and normalization processing.
Further, the step of processing the training data set again by using the random walk of the graph embedding model to generate the basic block sequence specifically includes:
and (3) regarding the basic block as a graph node, regarding the jump relation as an adjacent edge of the graph, calculating the most probable next node of the current node by random walk according to the transition probability, and further sampling:
finally, a group of binary execution paths covering all the basic blocks is obtained, and all basic block sequences generated by random walk are gathered to obtain a total function sequence used for training.
The method for learning how to generate the function vector consisting of the basic block sequence by adopting the assembly code semantics and the control flow graph learning model comprises the following steps:
the neighbor instruction of the current instruction is used for capturing the semantic relation between the operation code and the operand; the function vector is used for memorizing unpredictable information in the context, and finally training an embedded function vector; all of the preprocessed assembly code instructions participate in the training.
Further, the extracting the feature vector of the effective feature information image by using the spatial pyramid pooling layer is specifically to complete the classification of the feature image by connecting a softmax layer:
a first layer of pyramid, dividing the whole image into 16 blocks, wherein under the condition of image size (w, h), the size of each extracted image block is (w/4, h/4);
the second layer of pyramid divides the whole picture into 4 blocks, and under the condition of image size (w, h), the size of each extracted image block is (w/2, h/2);
and a third layer of pyramid, which divides the whole picture into a whole block, wherein in the case of the image size (w, h), the extracted image block size is (w, h).
In a second aspect, a classification apparatus for visualizing malware includes: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of the first aspect.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention integrates the assembly operation code, the operand semantics and the program logic structure information, combines the processing technology of computer vision, and more accurately and quickly classifies malicious families;
2. the method combines the assembly code sequence semantic features and the control flow graph structural features which are closest to the source codes as the basis of classification of the malicious software families, the assembly code semantics represents the behaviors of the code blocks, the control flow graph represents the execution sequence of the program, the two kinds of information jointly form the operation logic of the program, and the operation behavior of the program represented by a binary file can be more accurately represented by analyzing the two kinds of information, so that the prediction precision is improved; when the test is carried out on the data set mentioned above, the invention can achieve very high detection accuracy compared with other methods;
3. the method uses the assembly language model to process the code sequence, the assembly function possibly looks different in grammar but shares similar function logic in a source code, the method grasps the similar logic to learn richer and more general semantic information among the assembly codes, the semantic information features very vividly reflect the program meaning of the assembly codes, and for the assembly language model, the bulkiness of a training set determines the accuracy of the semantic information, namely, vector representation of the section of code can be accurately represented for all the assembly code sequences; on the data set mentioned above, the model used by the method can obtain the prediction accuracy rate which is very close to that of the training set on the data which is not on the training set, and the method is proved to be well generalized to new data;
4. the method for fusing the sequence semantic information and the structural information, which is adopted by the invention, gives better interpretability and a more visual optimization direction to the whole scheme; as mentioned above, in the current scheme, classification of malicious files is mostly regarded as a black box process, and internal logic of a program represented by a binary file is not considered, so that the reason for obtaining a better effect is difficult to explain; the invention starts from the execution process of the binary file, and the characteristics of the execution logic of the analysis program are used for distinguishing, so that a practitioner can more intuitively understand the reason for the effect of the invention and is also helpful for the subsequent optimization of the invention;
5. the invention benefits from the rapid development of the computer vision technology, classifies and visualizes the malicious software family, further improves the classification accuracy and enables non-professionals to understand and identify different malicious software well; and further, the detection of the visual malicious software is improved, and the safety of the network environment is maintained.
Drawings
FIG. 1 is a schematic diagram of a node walk network graph structure;
FIG. 2 is a flow chart of a method of classifying visual malware;
FIG. 3 is a flow chart of function control;
FIG. 4 is a schematic illustration of a control flow graph serialization and normalization process;
FIG. 5 is a schematic diagram of function vector generation;
FIG. 6 is a schematic diagram of a pyramid model of a training space.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
The embodiment of the invention specifically relates to the following technologies:
static analysis program technology
Static parsing techniques involve a series of operations on the binary file's file type, the programming language, whether to encrypt and shell, and assembly code acquisition. Of these operations, the acquisition of assembly code is the most critical. The assembly code reflects the execution logic of the program to a certain extent and has a vital role in analyzing malicious codes. A professional disassembly technical tool IDA Pro can be used for obtaining assembly codes of a binary file, and meanwhile, a control flow Chart (CFG) corresponding to an assembly function can be obtained. Subsequent CFGs will play a role in the first step of the classification model.
Second, random walk Node2vecWalk of graph embedding model
Node2vecWalk is a graph embedding method that comprehensively considers a DFS (depth-first search algorithm) neighborhood and a BFS (breadth-first search algorithm) neighborhood. The BFS can obtain neighbor nodes of each node, and a local microscopic view is highlighted; DFS can explore the network structure of the node deeply, have highlighted the overall macroscopic view. The embodiment of the invention uses the first two steps of algorithm realization: a. preprocessing and calculating the transition probability; b. biased second-order random walk.
Through the two steps, the sequence set of the graph nodes can be efficiently and quickly sampled, and the sequence set can be used for generating function vectors in the model.
The rationale formulation of the algorithms a, b is as follows:
wherein, P represents the probability of the node v migrating to the node x, pi is the transition probability before standardization, and Z is a standardization constant.
Π vx =α pq (t,x)·ω vx
(2)
Wherein, ω is vx Weights representing edges between nodes v and x, if any, need to be updated for normalized transition probabilities Π vx 。
Wherein alpha is pq (t, x) represents the relationship of v's neighbors to the last node t, d tx 0 means wandering from v back to the starting point t, d tx 1 represents that the neighbor node of the node v has an edge relationship with the starting point t; d tx 2 represents the relation that the neighbor nodes of the node v and the starting point t have no edge; p represents the probability that p controls to return to the starting point t from the v node in sampling, so that the probability is withdrawal probability, and the larger p is, the smaller withdrawal probability is; q represents the probability that q controls a t-v-new node in a sample if q is>1-walk is biased towards accessing local nodes (BFS) because 1 is now in place>1/q, if q is<1, the walk is biased towards accessing the global node (DFS), since 1 is now present<1/q. The final node walk network graph structure is shown in fig. 1.
III, assembly language model Asm2vec
The assembly language model is extended based on the natural language processing model PV-DM. PV-DM is similar to CBOW in word2vec, missing values are predicted through context, word vectors of the context and current paragraph vectors are fused together in a certain mode, prediction results of words in a vocabulary table are obtained through a softmax layer, and then loss back propagation updating gradient is calculated through label. Finally, both word vectors and paragraph vectors can be obtained. In Asm2vec, the feature vector is generated by assembling the code sequence, the assembly function itself and the sliding window to learn the semantics between Opcodes and Operations, and memorizing the weighted mixture among the Context represented by the assembly function.
The model involves the following loss function:
wherein,
(symbol) | means of |
∑ | Cumulative number |
RP | Function set |
f s | Any one function |
S(f s ) | Sequence set |
seq i | Any one of the sequences |
I(seq i ) | Instruction set |
in j | One of the instructions |
T(in j ) | In instructions, concatenation of opcodes and operands |
t c | One of the operation codes |
Four, deep learning model SPP-Net
SPP-Net is also known as spatial pyramid pooling, a neural network model for image classification. The method is characterized in that the characteristic diagram with any size can be converted into the characteristic vector with fixed size (the characteristic vector with fixed size is extracted from multi-scale characteristics), which is the significance of space pyramid pooling, and then the characteristic diagram is sent to a full connection layer for classification. The overall frame is roughly: inputting an image, extracting features by a convolutional layer, extracting features with fixed sizes by a spatial pyramid pooling method, and fully connecting layers.
The basic principle is as follows: in fact, from 21 picture blocks, the maximum value of each block is calculated, so as to obtain an output neuron. Finally, an arbitrary size picture is converted into a fixed size 21-dimensional feature.
The formula is as follows:
ceil (), floor (), and floor (), respectively.
Wherein,
(symbol) | means of |
h in | Height |
w in | Width of |
K h | Height of nucleus |
S h | Step length in height direction |
P h | The number of height direction fillings needs to be multiplied by 2 |
K w | Width of the nucleus |
S w | Step length in width direction |
P w | The number of filling in the width direction needs to be multiplied by 2 |
Example 1
The malware family classification method based on the disassembled code structure and semantic features comprises the following steps:
101: analyzing the binary file by using a disassembling tool to obtain assembly code representation of the binary file, utilizing the assembly code, storing the asm file, analyzing the asm file, distinguishing basic blocks by adopting a jump instruction and an end instruction, generating a corresponding control flow diagram according to a logic structure, wherein the nodes of the control flow diagram are one or more assembly sentences, and the edges of the control flow diagram represent the jump relation between the nodes;
102: using a Node2vecWalk model to walk a control flow graph of a function to generate a plurality of biased execution path sequences, and then normalizing operands in the path sequences: the method specifically comprises the steps of replacing a register, a memory address and a numerical value;
because the influence weight of the operand on the semantic relation is small, the processing process of the assembly language model Asm2vec can be optimized by replacing.
103: and vectorizing the obtained function sequence by using Asm2vec, further splicing the feature vectors to obtain a feature image, and finally classifying the feature image by using a deep learning algorithm SSP-Net so as to realize classification of malicious software.
In summary, in the embodiments of the present invention, the vector representation of the control flow graph and the semantic information obtained in the above steps is used to classify the binary files by using an image classification technology, so as to determine the family of the malware.
Example 2
Referring to fig. 2, an embodiment of the present invention provides a method for classifying a malware family based on a disassembled code structure and semantic features, and the method is used to classify the malware family, so that accuracy of classification is improved, detection rate of visual malware is improved, and security of a network environment is maintained, and the embodiment of the present invention is specifically implemented as follows:
step 201: preprocessing a malware dataset, comprising: generating a disassembled file, extracting subprocesses, converting a control structure, carrying out normalization processing and the like to form a training data set;
step 202: further processing the training data set obtained in step 201 by using a random walk Node2vecWalk of the graph embedding model, referring to the walk strategy of fig. 1, to generate a basic block sequence;
step 203: for the basic block sequence generated in the step 202, adopting an assembly code semantic and control flow graph learning model Asm2vec to learn how to generate a function vector consisting of the basic block sequence;
step 204: performing full splicing operation on the function vector in the step 203 to further obtain an effective characteristic information image of a malicious sample;
step 205: for the effective information image obtained in step 204, a Spatial Pyramid Pooling (SPP) layer is used to extract a feature vector of the image, and a softmax layer is further connected to complete classification of the feature image, so that classification of a malware family is achieved.
Wherein, the data collection and pretreatment stage is as follows:
in the field of malware families, naming and classification of the families are not a strict standard, and network security manufacturers can name the families according to own standards when discovering new malicious samples, so that the families are slowly acquiesced by everyone. Therefore, in an actual application scenario, the actual problem should be considered and adjusted accordingly, and an appropriate data set should be selected for classification prediction.
Based on the above problems, embodiments of the present invention employ accepted data sets as well as data sets from authoritative approval of VirusTotal for experimental evaluation. A recognized dataset is an annotated dataset from Microsoft corporation that relates two types of files for a sample, namely an asm file (disassembled file) and a byte file (binary file), respectively, and also provides a family label to which each sample belongs. This family tag is used for later family classification.
Asm file validation, but is not limited to the present dataset (e.g., the current latest malicious sample dataset provided by VirusTotal). Since the data set provided by Microsoft corporation contains asm files, only the processes of extracting sub-processes, controlling the conversion of structures, normalizing and the like are needed. However, the data set provided by the VirusTotal cannot directly acquire the asm file of the software, so that the data set can be disassembled into the binary file by using tools such as IDA Pro and the like to generate the corresponding asm file, and the subsequent assembly code preprocessing process is the same as the above.
The new malicious sample classification must be processed in a first step using a disassembling tool.
Extracting the subprocess, wherein all the code blocks divided by SUBROUTENE in the asm file can be captured; the conversion of the control structure can divide the captured code blocks by instructions such as jump, end and the like; the normalization process is embodied in the replacement of registers, memory addresses, and numerical values.
Wherein, generating a basic block sequence:
the Control Flow Graph (CFG) extracted in step 201 is processed by Node2vecWalk to generate a basic block sequence. See fig. 4 for one of the basic block sequences. In this step, the basic blocks are regarded as graph nodes, and the jump relationships are regarded as adjacent edges of the graph. The Node2vecWalk calculates the most probable next Node of the current Node according to the transition probability of the formula (1), and then performs sampling.
By this sampling approach, a set of possible binary execution paths covering almost all basic blocks will end up. And aggregating all basic block sequences generated by random walk to obtain a total function sequence for training.
The generating function feature vector is as follows:
after the function sequence is obtained, the function sequence is used as the input of an assembly language embedded model and is continuously learned through Asm2vec, wherein neighbor instructions of the current instruction are used for capturing semantic relations of operation codes and operands; the function vector is used for memorizing unpredictable information in the context, and finally the embedded function vector is trained. In practice, all of the preprocessed assembly code instructions participate in the training to make the extracted information accurate and efficient.
In the training process, the instruction and function vector are set to 200 dimensions, the sliding window size is set to 3: namely, the context instruction and the current prediction instruction. With the above arrangement, the model achieves the result of training the function vector by maximizing the following log-probability equation (2).
By continuously training the assembly language model, a language model in which assembly code information is embedded can be obtained. The language model aims to obtain a more accurate and generalized semantic information extraction model, so a large number of documents are used for training. After the training is finished, the method can exist independently of downstream tasks, and in a semantic extraction task for a new related assembly code, pre-training can be directly used to obtain related parameters without secondary training. The semantic information extraction model is used in the subsequent steps as a method for extracting semantic information of control flow graph nodes.
Wherein, the generation of the effective information image:
before step 205, i.e. before performing SSP-Net model training, the feature vectors need to be converted into images. Because similar or even identical code logics exist in the malware family, the image textures of similar code segments have high similarity when the similar code segments are converted into images through a visualization method, and by means of the characteristic, after the malicious codes are converted into the images, the malicious codes can be classified according to the family to which the malicious codes belong through a deep learning method. However, in the previous work, only the conversion is performed on the basis of a binary file, and the binary file is a character string consisting of a string of 01 numbers and cannot reflect the execution logic inherent in the program at all, so that the image conversion and classification performed by the embodiment of the invention at the assembly code layer close to the source code has a better classification effect.
And splicing the feature vectors to obtain a 200 × N effective feature image.
The SPP-net is used for training and classifying as follows:
the core idea of spatial pyramid pooling is as follows: by performing pooling on each part of feature map with corresponding scale, the specific selection process is as follows:
a first layer of pyramid, which divides the whole image into 16 blocks, wherein under the condition of the image size (w, h), the size of each extracted image block is (w/4, h/4);
the second layer of pyramid divides the whole picture into 4 blocks, and under the condition of image size (w, h), the size of each extracted image block is (w/2, h/2);
and a third layer of pyramid, which divides the whole picture into a whole block, wherein in the case of the image size (w, h), the extracted image block size is (w, h).
By the selection method, the pooling layer can be Pooling to generate feature maps of 4 × 4,2 × 2 and 1 × 1, and the feature maps are connected into column vectors to be connected with the next full-connection layer. This eliminates the effect of input scale inconsistencies.
By replacing the pooling layer of the neural network by spatial pyramid pooling, the following advantages can be obtained on the network:
1) the output of the spatial pyramid pooling has no relation with the input, and the image input with any size can generate the same output;
2) the use of multiple windows of different visual sizes on the sample allows for maximum preservation of image features.
By the method, feature maps of different sizes can be unified and output, and the feature map of each channel can contribute 21-dimensional features to the output of the spatial pyramid pooling. And finally, completing classification of the malicious family images through a full connection layer to a softmax classifier.
In specific implementation, the corresponding image is generated by extracting the features of one malicious sample, and because the image classification effect of the computer vision technology is remarkable, after the image classification is realized by utilizing SPP-net (a computer vision technology), each malicious sample also completes automatic classification.
During training, BatchSize is set to 32, optimizer Adam, learning rate is set to 0.0001, for a total of 1000 rounds of training. The data sets used by the malware family classification model designed by the embodiment of the invention are Microsoft challenge match public data sets and authoritative virus classification website authentication data sets, and corresponding adjustment can be made according to specific use scenes in actual deployment.
Example 3
A classification apparatus for visualizing malware, referring to fig. 2, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the steps of:
preprocessing a malicious software data set to form a training data set;
reprocessing the training data set by using the random walk of the graph embedding model to generate a basic block sequence;
adopting assembly code semantics and a control flow graph learning model to learn how to generate a function vector consisting of a basic block sequence;
performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing a spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malicious software family.
Wherein the pretreatment comprises the following steps: generation of a disassembled file, extraction of a subprocess, conversion of a control structure and normalization processing.
Further, the step of processing the training data set again by using the random walk of the graph embedding model to generate the basic block sequence specifically includes:
and (3) regarding the basic block as a graph node, regarding the jump relation as an adjacent edge of the graph, calculating the most possible next node of the current node by random walk according to the transition probability, and further sampling:
finally, a group of binary execution paths covering all the basic blocks is obtained, and all the basic block sequences generated by random walk are gathered to obtain a total function sequence used for training.
The method comprises the following steps of learning how to generate a function vector consisting of a basic block sequence by adopting an assembly code semantic and control flow graph learning model, wherein the function vector comprises the following steps:
the neighbor instruction of the current instruction is used for capturing the semantic relation between the operation code and the operand; the function vector is used for memorizing unpredictable information in the context, and finally training an embedded function vector; all of the preprocessed assembly code instructions participate in the training.
Further, the extracting the feature vector of the effective feature information image by using the spatial pyramid pooling layer is to complete the classification of the feature image by connecting a softmax layer, specifically:
a first layer of pyramid, dividing the whole image into 16 blocks, wherein under the condition of image size (w, h), the size of each extracted image block is (w/4, h/4);
the second layer of pyramid divides the whole picture into 4 blocks, and under the condition of image size (w, h), the size of each extracted image block is (w/2, h/2);
and a third layer of pyramid, namely dividing the whole picture into a whole block, wherein in the case of the image size (w, h), the extracted image block size is (w, h).
It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.
The execution main bodies of the processor and the memory can be devices with calculation functions such as a computer, a single chip microcomputer and a microcontroller, and in the specific implementation, the execution main bodies are not limited in the embodiment of the invention and are selected according to the requirements in practical application. The data signals are transmitted between the memory and the processor through the bus, which is not described in detail in the embodiment of the present invention.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, but rather as the subject matter of the invention is to be construed in all aspects and as broadly as possible, and all changes, equivalents and modifications that fall within the true spirit and scope of the invention are therefore intended to be embraced therein.
Claims (6)
1. A classification method for visual malware, the method comprising the steps of:
preprocessing a malicious software data set to form a training data set;
the random walk of the graph embedding model is used for reprocessing the training data set to generate a basic block sequence;
adopting assembly code semantics and a control flow graph learning model to learn how to generate a function vector consisting of a basic block sequence;
performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing a spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malicious software family.
2. The classification method for visual malware according to claim 1, wherein the preprocessing is: generation of a disassembled file, extraction of a subprocess, conversion of a control structure and normalization processing.
3. The classification method for visual malware according to claim 1, wherein the random walk using the graph embedding model is used to reprocess the training data set, and the generating of the basic block sequence specifically includes:
and (3) regarding the basic block as a graph node, regarding the jump relation as an adjacent edge of the graph, calculating the most probable next node of the current node by random walk according to the transition probability, and further sampling:
finally, a group of binary execution paths covering all the basic blocks is obtained, and all the basic block sequences generated by random walk are gathered to obtain a total function sequence used for training.
4. The method for classifying visual malware according to claim 1, wherein the function vector formed by how to generate the basic block sequence by learning the assembly code semantics and control flow graph learning model is as follows:
the neighbor instruction of the current instruction is used for capturing the semantic relation between the operation code and the operand; the function vector is used for memorizing unpredictable information in the context, and finally training an embedded function vector; all of the preprocessed assembly code instructions participate in the training.
5. The method for classifying visualized malware according to claim 1, wherein the extracting the feature vectors of the valid feature information images by using the spatial pyramid pooling layer is accomplished by connecting a softmax layer, and specifically:
a first layer of pyramid, dividing the whole image into 16 blocks, wherein under the condition of image size (w, h), the size of each extracted image block is (w/4, h/4);
the second layer of pyramid divides the whole picture into 4 blocks, and under the condition of image size (w, h), the size of each extracted image block is (w/2, h/2);
and a third layer of pyramid, which divides the whole picture into a whole block, wherein in the case of the image size (w, h), the extracted image block size is (w, h).
6. A classification apparatus for visualizing malware, the classification apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210672136.1A CN115098857B (en) | 2022-06-15 | 2022-06-15 | Visual malicious software classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210672136.1A CN115098857B (en) | 2022-06-15 | 2022-06-15 | Visual malicious software classification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115098857A true CN115098857A (en) | 2022-09-23 |
CN115098857B CN115098857B (en) | 2024-07-12 |
Family
ID=83291061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210672136.1A Active CN115098857B (en) | 2022-06-15 | 2022-06-15 | Visual malicious software classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115098857B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116578979A (en) * | 2023-05-15 | 2023-08-11 | 软安科技有限公司 | Cross-platform binary code matching method and system based on code features |
CN117932607A (en) * | 2024-03-20 | 2024-04-26 | 山东省计算中心(国家超级计算济南中心) | Lesu software detection method, system, medium and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558735A (en) * | 2018-12-03 | 2019-04-02 | 杭州安恒信息技术股份有限公司 | A kind of rogue program sample clustering method and relevant apparatus based on machine learning |
CN111368304A (en) * | 2020-03-31 | 2020-07-03 | 绿盟科技集团股份有限公司 | Malicious sample category detection method, device and equipment |
CN112948828A (en) * | 2021-01-25 | 2021-06-11 | 厦门服云信息科技有限公司 | Binary program malicious code detection method, terminal device and storage medium |
CN112966271A (en) * | 2021-03-18 | 2021-06-15 | 中山大学 | Malicious software detection method based on graph convolution network |
CN113434858A (en) * | 2021-05-25 | 2021-09-24 | 天津大学 | Malicious software family classification method based on disassembly code structure and semantic features |
-
2022
- 2022-06-15 CN CN202210672136.1A patent/CN115098857B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558735A (en) * | 2018-12-03 | 2019-04-02 | 杭州安恒信息技术股份有限公司 | A kind of rogue program sample clustering method and relevant apparatus based on machine learning |
CN111368304A (en) * | 2020-03-31 | 2020-07-03 | 绿盟科技集团股份有限公司 | Malicious sample category detection method, device and equipment |
CN112948828A (en) * | 2021-01-25 | 2021-06-11 | 厦门服云信息科技有限公司 | Binary program malicious code detection method, terminal device and storage medium |
CN112966271A (en) * | 2021-03-18 | 2021-06-15 | 中山大学 | Malicious software detection method based on graph convolution network |
CN113434858A (en) * | 2021-05-25 | 2021-09-24 | 天津大学 | Malicious software family classification method based on disassembly code structure and semantic features |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116578979A (en) * | 2023-05-15 | 2023-08-11 | 软安科技有限公司 | Cross-platform binary code matching method and system based on code features |
CN116578979B (en) * | 2023-05-15 | 2024-05-31 | 软安科技有限公司 | Cross-platform binary code matching method and system based on code features |
CN117932607A (en) * | 2024-03-20 | 2024-04-26 | 山东省计算中心(国家超级计算济南中心) | Lesu software detection method, system, medium and equipment |
CN117932607B (en) * | 2024-03-20 | 2024-09-24 | 山东省计算中心(国家超级计算济南中心) | Lesu software detection method, system, medium and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN115098857B (en) | 2024-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639344B (en) | Vulnerability detection method and device based on neural network | |
CN113641586B (en) | Software source code defect detection method, system, electronic equipment and storage medium | |
CN110348214B (en) | Method and system for detecting malicious codes | |
CN111444320A (en) | Text retrieval method and device, computer equipment and storage medium | |
CN115098857B (en) | Visual malicious software classification method and device | |
CN113297580B (en) | Code semantic analysis-based electric power information system safety protection method and device | |
CN116702160B (en) | Source code vulnerability detection method based on data dependency enhancement program slice | |
CN113434858A (en) | Malicious software family classification method based on disassembly code structure and semantic features | |
CN112418320A (en) | Enterprise association relation identification method and device and storage medium | |
CN112100377A (en) | Text classification method and device, computer equipment and storage medium | |
Mao et al. | Explainable software vulnerability detection based on attention-based bidirectional recurrent neural networks | |
CN116361788A (en) | Binary software vulnerability prediction method based on machine learning | |
CN113159315A (en) | Neural network training method, data processing method and related equipment | |
CN117521066A (en) | Semantic enhancement type malicious software detection method for industrial Internet | |
CN117272142A (en) | Log abnormality detection method and system and electronic equipment | |
CN116340952A (en) | Intelligent contract vulnerability detection method based on operation code program dependency graph | |
CN111400713A (en) | Malicious software family classification method based on operation code adjacency graph characteristics | |
CN111738290A (en) | Image detection method, model construction and training method, device, equipment and medium | |
CN116522337A (en) | API semantic-based unbiased detection method for malicious software family | |
CN113836297B (en) | Training method and device for text emotion analysis model | |
Ataman et al. | Transforming large-scale participation data through topic modelling in urban design processes | |
CN114722400A (en) | Side channel vulnerability detection method, system, medium, equipment and terminal | |
Song et al. | Multi-model Smart Contract Vulnerability Detection Based on BiGRU | |
CN113988059A (en) | Session data type identification method, system, equipment and storage medium | |
Yang et al. | Network Configuration Entity Extraction Method Based on Transformer with Multi-Head Attention Mechanism. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |