CN115098857A

CN115098857A - Visual malicious software classification method and device

Info

Publication number: CN115098857A
Application number: CN202210672136.1A
Authority: CN
Inventors: 宁国庆; 刘健
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-23
Anticipated expiration: 2042-06-15
Also published as: CN115098857B

Abstract

The invention discloses a method and a device for classifying visual malicious software, wherein the method comprises the following steps: preprocessing a malicious software data set to form a training data set; the random walk of the graph embedding model is used for reprocessing the training data set to generate a basic block sequence; adopting assembly code semantics and a control flow graph learning model to learn how to generate a function vector consisting of a basic block sequence; performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing the spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malware families. The device comprises: a processor and a memory. The invention utilizes SPP-Net (spatial pyramid pooling network) to classify the characteristic images finally, thereby improving the detection of visual malicious software and maintaining the security of network environment.

Description

Visual malicious software classification method and device

Technical Field

The invention relates to the field of network security, in particular to a method and a device for classifying visual malicious software.

Background

The "Malware" is formed by the combination of two words of malicious and software. Malware definitions published by the chinese internet association: the malicious software is software which is installed and operated on a user computer or other terminals under the condition that a user is not explicitly prompted or the user is not authorized, and the legitimate rights and interests of the user are infringed, but the malicious software does not contain computer viruses specified by national laws and regulations.

With the rapid development of the internet, the network security problem is increasingly highlighted, and particularly in the field of malicious software, namely the explosive growth of the lasso software, huge economic losses are brought to users and enterprises using computer systems. According to the data of sonic wall, the growth rate of the internet of things malware attack and the like is 6% -19%, however, the growth trend of the lemonade attack is exponential, the number of the lemonade attack is increased by about 232% compared with 2019, and the number of the lemonade attack is increased by nearly 3.19 hundred million compared with 2020.

With the rise of emerging technologies such as cloud computing, internet of things, big data and 5G, the types of malicious software are more diversified, complicated programs such as encryption, deformation and self-protection appear, the number of the programs is larger, and the anti-malicious software technology is provided with huge challenges. Therefore, there is a need for faster, accurate, and continuously tracked anti-malware technologies to address this.

One of the major challenges faced by anti-malware is the need to evaluate the potential malicious intent and execution behavior of large amounts of data and files. One of the main reasons for these large numbers of different files is that to evade detection, malware developers introduce polymorphisms into the malicious components, which means that malicious files belonging to the same malware "family" have the same form of malicious behavior, which can be continually modified or obfuscated using various policies, making them look like many different files.

Thus, the characteristic of "family" can be utilized to classify malware, and to better understand the way malware infects computers and devices, the level of threat posed, and the corresponding precautions.

There are also complete data sets that have been proposed in prior work for training malware classification models, such as: the data set used hereinafter to explain the present work, a malware family classification challenge race from Microsoft corporation published on the kaggle website in 2015. The data set collects a total of 10868 malware for nine families, removes the PE (portable executable file) header of the file through pre-processing, and provides the disassembled result of the file (. asm file) and the binary representation file (. byte file), while the 10868 files are given labels to indicate to which family the file belongs.

Based on this data set, a number of methods have also been created for family classification of malware. For example: the method for converting the binary file into the image comprises the following steps: the method is originally proposed by Nataraj and Karthikeyan et al in 2011, the ideas of the Nataraj and Karthikeyan et al are novel, a binary file is displayed in a gray-scale image form, and classification of malicious files is achieved by means of GIST (global feature) texture features in an image and a KNN (K-nearest neighbor algorithm) algorithm. Then, many researchers have developed and searched based on this idea, and in China, typical ones are researches conducted by korea and the like, which perform feature extraction on a gray-scale image by a gray-scale co-occurrence matrix algorithm, then calculate features with the largest weight by a feature selection algorithm, and establish a texture feature library after normalization. For another example: the assembly code detection method comprises the following steps: in 2018, Guosong Sun et al vectorize OpCodes, then merge static analysis and assembly codes (extracted from unlabeled data) with time characteristics generated by RNN (recurrent neural network), process the merged codes by using minhash, generate characteristic images, and finally train the characteristic images by using CNN (convolutional neural network) for classification.

The assembly code detection method has already proposed the use of assembly operation codes for malicious software detection and classification, but only uses the precedence order of the operation codes for processing, thereby losing the connection between semantics.

Disclosure of Invention

The invention provides a classification method and a device of visual malicious software, which are used for accurately and quickly identifying the family classification of the malicious software, applying a disassembly technology and a deep learning algorithm to the field of malicious software analysis and providing a whole set of complete malicious software family classification flow; the embedded graph model Node2vecWalk and the assembly language model are used for integrating and processing sequence semantic information and structure information of the PE file obtained by disassembling, so as to further obtain a characteristic image of the file, and SPP-Net (space pyramid pooling network) is used for final classification, so that the detection of visual malicious software is improved, the safety of a network environment is maintained, and the detailed description is as follows:

in a first aspect, a classification method for visualized malicious software comprises the following steps:

preprocessing a malicious software data set to form a training data set;

reprocessing the training data set by using the random walk of the graph embedding model to generate a basic block sequence;

adopting assembly code semantics and a control flow graph learning model to learn how to generate a function vector consisting of a basic block sequence;

performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing the spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malware families.

Wherein the pretreatment comprises the following steps: generation of a disassembled file, extraction of a subprocess, conversion of a control structure and normalization processing.

Further, the step of processing the training data set again by using the random walk of the graph embedding model to generate the basic block sequence specifically includes:

and (3) regarding the basic block as a graph node, regarding the jump relation as an adjacent edge of the graph, calculating the most probable next node of the current node by random walk according to the transition probability, and further sampling:

finally, a group of binary execution paths covering all the basic blocks is obtained, and all basic block sequences generated by random walk are gathered to obtain a total function sequence used for training.

The method for learning how to generate the function vector consisting of the basic block sequence by adopting the assembly code semantics and the control flow graph learning model comprises the following steps:

the neighbor instruction of the current instruction is used for capturing the semantic relation between the operation code and the operand; the function vector is used for memorizing unpredictable information in the context, and finally training an embedded function vector; all of the preprocessed assembly code instructions participate in the training.

Further, the extracting the feature vector of the effective feature information image by using the spatial pyramid pooling layer is specifically to complete the classification of the feature image by connecting a softmax layer:

a first layer of pyramid, dividing the whole image into 16 blocks, wherein under the condition of image size (w, h), the size of each extracted image block is (w/4, h/4);

the second layer of pyramid divides the whole picture into 4 blocks, and under the condition of image size (w, h), the size of each extracted image block is (w/2, h/2);

and a third layer of pyramid, which divides the whole picture into a whole block, wherein in the case of the image size (w, h), the extracted image block size is (w, h).

In a second aspect, a classification apparatus for visualizing malware includes: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of the first aspect.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention integrates the assembly operation code, the operand semantics and the program logic structure information, combines the processing technology of computer vision, and more accurately and quickly classifies malicious families;

2. the method combines the assembly code sequence semantic features and the control flow graph structural features which are closest to the source codes as the basis of classification of the malicious software families, the assembly code semantics represents the behaviors of the code blocks, the control flow graph represents the execution sequence of the program, the two kinds of information jointly form the operation logic of the program, and the operation behavior of the program represented by a binary file can be more accurately represented by analyzing the two kinds of information, so that the prediction precision is improved; when the test is carried out on the data set mentioned above, the invention can achieve very high detection accuracy compared with other methods;

3. the method uses the assembly language model to process the code sequence, the assembly function possibly looks different in grammar but shares similar function logic in a source code, the method grasps the similar logic to learn richer and more general semantic information among the assembly codes, the semantic information features very vividly reflect the program meaning of the assembly codes, and for the assembly language model, the bulkiness of a training set determines the accuracy of the semantic information, namely, vector representation of the section of code can be accurately represented for all the assembly code sequences; on the data set mentioned above, the model used by the method can obtain the prediction accuracy rate which is very close to that of the training set on the data which is not on the training set, and the method is proved to be well generalized to new data;

4. the method for fusing the sequence semantic information and the structural information, which is adopted by the invention, gives better interpretability and a more visual optimization direction to the whole scheme; as mentioned above, in the current scheme, classification of malicious files is mostly regarded as a black box process, and internal logic of a program represented by a binary file is not considered, so that the reason for obtaining a better effect is difficult to explain; the invention starts from the execution process of the binary file, and the characteristics of the execution logic of the analysis program are used for distinguishing, so that a practitioner can more intuitively understand the reason for the effect of the invention and is also helpful for the subsequent optimization of the invention;

5. the invention benefits from the rapid development of the computer vision technology, classifies and visualizes the malicious software family, further improves the classification accuracy and enables non-professionals to understand and identify different malicious software well; and further, the detection of the visual malicious software is improved, and the safety of the network environment is maintained.

Drawings

FIG. 1 is a schematic diagram of a node walk network graph structure;

FIG. 2 is a flow chart of a method of classifying visual malware;

FIG. 3 is a flow chart of function control;

FIG. 4 is a schematic illustration of a control flow graph serialization and normalization process;

FIG. 5 is a schematic diagram of function vector generation;

FIG. 6 is a schematic diagram of a pyramid model of a training space.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

The embodiment of the invention specifically relates to the following technologies:

static analysis program technology

Static parsing techniques involve a series of operations on the binary file's file type, the programming language, whether to encrypt and shell, and assembly code acquisition. Of these operations, the acquisition of assembly code is the most critical. The assembly code reflects the execution logic of the program to a certain extent and has a vital role in analyzing malicious codes. A professional disassembly technical tool IDA Pro can be used for obtaining assembly codes of a binary file, and meanwhile, a control flow Chart (CFG) corresponding to an assembly function can be obtained. Subsequent CFGs will play a role in the first step of the classification model.

Second, random walk Node2vecWalk of graph embedding model

Node2vecWalk is a graph embedding method that comprehensively considers a DFS (depth-first search algorithm) neighborhood and a BFS (breadth-first search algorithm) neighborhood. The BFS can obtain neighbor nodes of each node, and a local microscopic view is highlighted; DFS can explore the network structure of the node deeply, have highlighted the overall macroscopic view. The embodiment of the invention uses the first two steps of algorithm realization: a. preprocessing and calculating the transition probability; b. biased second-order random walk.

Through the two steps, the sequence set of the graph nodes can be efficiently and quickly sampled, and the sequence set can be used for generating function vectors in the model.

The rationale formulation of the algorithms a, b is as follows:

wherein, P represents the probability of the node v migrating to the node x, pi is the transition probability before standardization, and Z is a standardization constant.

Π _vx ＝α _pq (t,x)·ω _vx

(2)

Wherein, ω is _vx Weights representing edges between nodes v and x, if any, need to be updated for normalized transition probabilities Π _vx 。

Wherein alpha is _pq (t, x) represents the relationship of v's neighbors to the last node t, d _tx 0 means wandering from v back to the starting point t, d _tx 1 represents that the neighbor node of the node v has an edge relationship with the starting point t; d _tx 2 represents the relation that the neighbor nodes of the node v and the starting point t have no edge; p represents the probability that p controls to return to the starting point t from the v node in sampling, so that the probability is withdrawal probability, and the larger p is, the smaller withdrawal probability is; q represents the probability that q controls a t-v-new node in a sample if q is>1-walk is biased towards accessing local nodes (BFS) because 1 is now in place>1/q, if q is<1, the walk is biased towards accessing the global node (DFS), since 1 is now present<1/q. The final node walk network graph structure is shown in fig. 1.

III, assembly language model Asm2vec

The assembly language model is extended based on the natural language processing model PV-DM. PV-DM is similar to CBOW in word2vec, missing values are predicted through context, word vectors of the context and current paragraph vectors are fused together in a certain mode, prediction results of words in a vocabulary table are obtained through a softmax layer, and then loss back propagation updating gradient is calculated through label. Finally, both word vectors and paragraph vectors can be obtained. In Asm2vec, the feature vector is generated by assembling the code sequence, the assembly function itself and the sliding window to learn the semantics between Opcodes and Operations, and memorizing the weighted mixture among the Context represented by the assembly function.

The model involves the following loss function:

wherein,

(symbol)	means of
		∑	Cumulative number
RP	Function set
		f _s	Any one function
S(f _s )	Sequence set
		seq _i	Any one of the sequences
I(seq _i )	Instruction set
		in _j	One of the instructions
T(in _j )	In instructions, concatenation of opcodes and operands
		t _c	One of the operation codes

Four, deep learning model SPP-Net

SPP-Net is also known as spatial pyramid pooling, a neural network model for image classification. The method is characterized in that the characteristic diagram with any size can be converted into the characteristic vector with fixed size (the characteristic vector with fixed size is extracted from multi-scale characteristics), which is the significance of space pyramid pooling, and then the characteristic diagram is sent to a full connection layer for classification. The overall frame is roughly: inputting an image, extracting features by a convolutional layer, extracting features with fixed sizes by a spatial pyramid pooling method, and fully connecting layers.

The basic principle is as follows: in fact, from 21 picture blocks, the maximum value of each block is calculated, so as to obtain an output neuron. Finally, an arbitrary size picture is converted into a fixed size 21-dimensional feature.

The formula is as follows:

ceil (), floor (), and floor (), respectively.

Wherein,

(symbol)	means of
		h _in	Height
w _in	Width of
		K _h	Height of nucleus
S _h	Step length in height direction
		P _h	The number of height direction fillings needs to be multiplied by 2
K _w	Width of the nucleus
		S _w	Step length in width direction
P _w	The number of filling in the width direction needs to be multiplied by 2

Example 1

The malware family classification method based on the disassembled code structure and semantic features comprises the following steps:

101: analyzing the binary file by using a disassembling tool to obtain assembly code representation of the binary file, utilizing the assembly code, storing the asm file, analyzing the asm file, distinguishing basic blocks by adopting a jump instruction and an end instruction, generating a corresponding control flow diagram according to a logic structure, wherein the nodes of the control flow diagram are one or more assembly sentences, and the edges of the control flow diagram represent the jump relation between the nodes;

102: using a Node2vecWalk model to walk a control flow graph of a function to generate a plurality of biased execution path sequences, and then normalizing operands in the path sequences: the method specifically comprises the steps of replacing a register, a memory address and a numerical value;

because the influence weight of the operand on the semantic relation is small, the processing process of the assembly language model Asm2vec can be optimized by replacing.

103: and vectorizing the obtained function sequence by using Asm2vec, further splicing the feature vectors to obtain a feature image, and finally classifying the feature image by using a deep learning algorithm SSP-Net so as to realize classification of malicious software.

In summary, in the embodiments of the present invention, the vector representation of the control flow graph and the semantic information obtained in the above steps is used to classify the binary files by using an image classification technology, so as to determine the family of the malware.

Example 2

Referring to fig. 2, an embodiment of the present invention provides a method for classifying a malware family based on a disassembled code structure and semantic features, and the method is used to classify the malware family, so that accuracy of classification is improved, detection rate of visual malware is improved, and security of a network environment is maintained, and the embodiment of the present invention is specifically implemented as follows:

step 201: preprocessing a malware dataset, comprising: generating a disassembled file, extracting subprocesses, converting a control structure, carrying out normalization processing and the like to form a training data set;

step 202: further processing the training data set obtained in step 201 by using a random walk Node2vecWalk of the graph embedding model, referring to the walk strategy of fig. 1, to generate a basic block sequence;

step 203: for the basic block sequence generated in the step 202, adopting an assembly code semantic and control flow graph learning model Asm2vec to learn how to generate a function vector consisting of the basic block sequence;

step 204: performing full splicing operation on the function vector in the step 203 to further obtain an effective characteristic information image of a malicious sample;

step 205: for the effective information image obtained in step 204, a Spatial Pyramid Pooling (SPP) layer is used to extract a feature vector of the image, and a softmax layer is further connected to complete classification of the feature image, so that classification of a malware family is achieved.

Wherein, the data collection and pretreatment stage is as follows:

in the field of malware families, naming and classification of the families are not a strict standard, and network security manufacturers can name the families according to own standards when discovering new malicious samples, so that the families are slowly acquiesced by everyone. Therefore, in an actual application scenario, the actual problem should be considered and adjusted accordingly, and an appropriate data set should be selected for classification prediction.

Based on the above problems, embodiments of the present invention employ accepted data sets as well as data sets from authoritative approval of VirusTotal for experimental evaluation. A recognized dataset is an annotated dataset from Microsoft corporation that relates two types of files for a sample, namely an asm file (disassembled file) and a byte file (binary file), respectively, and also provides a family label to which each sample belongs. This family tag is used for later family classification.

Asm file validation, but is not limited to the present dataset (e.g., the current latest malicious sample dataset provided by VirusTotal). Since the data set provided by Microsoft corporation contains asm files, only the processes of extracting sub-processes, controlling the conversion of structures, normalizing and the like are needed. However, the data set provided by the VirusTotal cannot directly acquire the asm file of the software, so that the data set can be disassembled into the binary file by using tools such as IDA Pro and the like to generate the corresponding asm file, and the subsequent assembly code preprocessing process is the same as the above.

The new malicious sample classification must be processed in a first step using a disassembling tool.

Extracting the subprocess, wherein all the code blocks divided by SUBROUTENE in the asm file can be captured; the conversion of the control structure can divide the captured code blocks by instructions such as jump, end and the like; the normalization process is embodied in the replacement of registers, memory addresses, and numerical values.

Wherein, generating a basic block sequence:

the Control Flow Graph (CFG) extracted in step 201 is processed by Node2vecWalk to generate a basic block sequence. See fig. 4 for one of the basic block sequences. In this step, the basic blocks are regarded as graph nodes, and the jump relationships are regarded as adjacent edges of the graph. The Node2vecWalk calculates the most probable next Node of the current Node according to the transition probability of the formula (1), and then performs sampling.

By this sampling approach, a set of possible binary execution paths covering almost all basic blocks will end up. And aggregating all basic block sequences generated by random walk to obtain a total function sequence for training.

The generating function feature vector is as follows:

after the function sequence is obtained, the function sequence is used as the input of an assembly language embedded model and is continuously learned through Asm2vec, wherein neighbor instructions of the current instruction are used for capturing semantic relations of operation codes and operands; the function vector is used for memorizing unpredictable information in the context, and finally the embedded function vector is trained. In practice, all of the preprocessed assembly code instructions participate in the training to make the extracted information accurate and efficient.

In the training process, the instruction and function vector are set to 200 dimensions, the sliding window size is set to 3: namely, the context instruction and the current prediction instruction. With the above arrangement, the model achieves the result of training the function vector by maximizing the following log-probability equation (2).

By continuously training the assembly language model, a language model in which assembly code information is embedded can be obtained. The language model aims to obtain a more accurate and generalized semantic information extraction model, so a large number of documents are used for training. After the training is finished, the method can exist independently of downstream tasks, and in a semantic extraction task for a new related assembly code, pre-training can be directly used to obtain related parameters without secondary training. The semantic information extraction model is used in the subsequent steps as a method for extracting semantic information of control flow graph nodes.

Wherein, the generation of the effective information image:

before step 205, i.e. before performing SSP-Net model training, the feature vectors need to be converted into images. Because similar or even identical code logics exist in the malware family, the image textures of similar code segments have high similarity when the similar code segments are converted into images through a visualization method, and by means of the characteristic, after the malicious codes are converted into the images, the malicious codes can be classified according to the family to which the malicious codes belong through a deep learning method. However, in the previous work, only the conversion is performed on the basis of a binary file, and the binary file is a character string consisting of a string of 01 numbers and cannot reflect the execution logic inherent in the program at all, so that the image conversion and classification performed by the embodiment of the invention at the assembly code layer close to the source code has a better classification effect.

And splicing the feature vectors to obtain a 200 × N effective feature image.

The SPP-net is used for training and classifying as follows:

the core idea of spatial pyramid pooling is as follows: by performing pooling on each part of feature map with corresponding scale, the specific selection process is as follows:

a first layer of pyramid, which divides the whole image into 16 blocks, wherein under the condition of the image size (w, h), the size of each extracted image block is (w/4, h/4);

By the selection method, the pooling layer can be Pooling to generate feature maps of 4 × 4,2 × 2 and 1 × 1, and the feature maps are connected into column vectors to be connected with the next full-connection layer. This eliminates the effect of input scale inconsistencies.

By replacing the pooling layer of the neural network by spatial pyramid pooling, the following advantages can be obtained on the network:

1) the output of the spatial pyramid pooling has no relation with the input, and the image input with any size can generate the same output;

2) the use of multiple windows of different visual sizes on the sample allows for maximum preservation of image features.

By the method, feature maps of different sizes can be unified and output, and the feature map of each channel can contribute 21-dimensional features to the output of the spatial pyramid pooling. And finally, completing classification of the malicious family images through a full connection layer to a softmax classifier.

In specific implementation, the corresponding image is generated by extracting the features of one malicious sample, and because the image classification effect of the computer vision technology is remarkable, after the image classification is realized by utilizing SPP-net (a computer vision technology), each malicious sample also completes automatic classification.

During training, BatchSize is set to 32, optimizer Adam, learning rate is set to 0.0001, for a total of 1000 rounds of training. The data sets used by the malware family classification model designed by the embodiment of the invention are Microsoft challenge match public data sets and authoritative virus classification website authentication data sets, and corresponding adjustment can be made according to specific use scenes in actual deployment.

Example 3

A classification apparatus for visualizing malware, referring to fig. 2, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the steps of:

preprocessing a malicious software data set to form a training data set;

performing full splicing operation on the function vector to obtain an effective characteristic information image of a malicious sample; and extracting the feature vectors of the effective feature information images by utilizing a spatial pyramid pooling layer, and completing the classification of the feature images by connecting a softmax layer, thereby realizing the classification of the malicious software family.

and (3) regarding the basic block as a graph node, regarding the jump relation as an adjacent edge of the graph, calculating the most possible next node of the current node by random walk according to the transition probability, and further sampling:

finally, a group of binary execution paths covering all the basic blocks is obtained, and all the basic block sequences generated by random walk are gathered to obtain a total function sequence used for training.

The method comprises the following steps of learning how to generate a function vector consisting of a basic block sequence by adopting an assembly code semantic and control flow graph learning model, wherein the function vector comprises the following steps:

Further, the extracting the feature vector of the effective feature information image by using the spatial pyramid pooling layer is to complete the classification of the feature image by connecting a softmax layer, specifically:

and a third layer of pyramid, namely dividing the whole picture into a whole block, wherein in the case of the image size (w, h), the extracted image block size is (w, h).

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor and the memory can be devices with calculation functions such as a computer, a single chip microcomputer and a microcontroller, and in the specific implementation, the execution main bodies are not limited in the embodiment of the invention and are selected according to the requirements in practical application. The data signals are transmitted between the memory and the processor through the bus, which is not described in detail in the embodiment of the present invention.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, but rather as the subject matter of the invention is to be construed in all aspects and as broadly as possible, and all changes, equivalents and modifications that fall within the true spirit and scope of the invention are therefore intended to be embraced therein.

Claims

1. A classification method for visual malware, the method comprising the steps of:

preprocessing a malicious software data set to form a training data set;

the random walk of the graph embedding model is used for reprocessing the training data set to generate a basic block sequence;

2. The classification method for visual malware according to claim 1, wherein the preprocessing is: generation of a disassembled file, extraction of a subprocess, conversion of a control structure and normalization processing.

3. The classification method for visual malware according to claim 1, wherein the random walk using the graph embedding model is used to reprocess the training data set, and the generating of the basic block sequence specifically includes:

4. The method for classifying visual malware according to claim 1, wherein the function vector formed by how to generate the basic block sequence by learning the assembly code semantics and control flow graph learning model is as follows:

5. The method for classifying visualized malware according to claim 1, wherein the extracting the feature vectors of the valid feature information images by using the spatial pyramid pooling layer is accomplished by connecting a softmax layer, and specifically:

6. A classification apparatus for visualizing malware, the classification apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor invoking the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-5.